Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Evidence that the NSA Is Storing Voice Content, Not Just Metadata (schneier.com)
283 points by Libertatea on June 18, 2013 | hide | past | favorite | 90 comments


The focus on content over meta-data is a red-herring. What you say during the phone call is just that...something you say. The meta-data shows what you do.

A person browses the web at a doctor's office. Forty-five minutes later, that person calls their spouse, then an oncologist. A few days later, their spouse checks email from the oncologists office. An hour later a call is made to a diagnostic center by the first person. The next day, they call their mother and a surgeon. [see interview with Susan Landau: http://www.democracynow.org/2013/6/12/more_intrusive_than_ea... ]

The conclusions which can be drawn from that meta-data are far more solid than can be drawn from listening in to any or all of the phone calls.

Your phone calls don't show what time you get up in the morning, or when you get to work or walk the dog to the park. The meta-data does.

The focus on the content of calls is 50 years out of date. The meta-data show the best time to burgle your house or fake your identity online. They show where you go and who you associate with. There's no need to know what you said.


I disagree. You give a situation which is suggestive of the person learning they have cancer, and is a perfect example of how mass tracking can expose personal but information about our lives that we would rather keep private (particularly medical information, which I believe is the most vital kind of privacy threatened by these mass tracking schemes). But I wouldn't say that conclusion that you suggested is solid. It is equally plausible that the diagnostic center told the patient that results were negative, and the call to the mother was to reassure her and to the surgeon to cancel a preemptive appointment made by the doctor.

> The conclusions which can be drawn from that meta-data are far more solid than can be drawn from listening in to any or all of the phone calls.

This is just false. If you were a lawyer trying to draw a portrait of intentions with the metadata in court, obviously it would help your case if you could play the tapes and show what was said in each of the calls, so the other side can't propose equally valid situations like I did and create reasonable doubt.

I think we can be so fascinated by the power of having data, that much like the other powerful forensic tools used in the courtroom, we are eager to connect all the dots like we see TV shows like CSI doing. It can be difficult to remind people that more data (particularly meta-data and not content) produces more plausible hypotheses as often as it narrows them down.


> It is equally plausible that the diagnostic center told the patient that results were negative, and the call to the mother was to reassure her and to the surgeon to cancel a preemptive appointment made by the doctor.

The metadata would not end with that phone call - the pattern of continuing activities would make fairly evident what the prognosis had been. (Of course this doesn't meet the "beyond all reasonable doubt" standard of criminal court, but it meets the standard for justifying digging deeper into their affairs, or setting insurance rates for that matter.)


But the problem is that there is no court, and they (NSA or whomever) are left to make their own conclusions and act on them before we have a chance to defend ourselves. In the example, you can draw (at least) two conclusions, the problem is: the NSA gets to pick the one that suits their needs and defend it with misinterpreted "meta data"


... in secret court.


So true. Even Target can make simple meta-data conclusions: http://www.forbes.com/sites/kashmirhill/2012/02/16/how-targe...


While I see and agree with your point (you can derive a lot of information from just meta-data), this is where you go overboard:

The conclusions which can be drawn from that meta-data are far more solid than can be drawn from listening in to any or all of the phone calls.

You cannot derive more information from meta-data than you can from the data itself. That's why we call it meta-data. "Honey, I may have cancer, I'm making an appointment with Dr. O. N'conlogist as soon as possible." is more compact, just as complete, and is more tolerant to missing parts.


It's also not necessarily correct or trustworthy. However, if you not only say "Honey, I may have cancer", and then go to the doctor and do research and go to the doctor again, and call your loved ones more often than you used to, and continue going to the doctor, now we know it's not just some conversational sarcasm or paranoia that the surveillance programs need to sift through, now we know you're serious, and you're worried, and you're dealing with it.


The value of meta-data is why people tag friends in photos and cameras geocode. It's called meta-data because it sits above data in the information hierarchy - just as a digital photograph sits above the bits and bytes on disk from which its image is composed.

"Honey, it's bad news," or "Darling, I need you to hold me," or "<sobbing>" don't say squat...unless you know they were made from the doctor's office and what calls were made afterwards.


This is missing the point that the meta-data is often less, not more, information. Now, if you have enough it can still be useful. But we are specifically talking about less, here. Giving someone more information is giving them more information. Period.

Consider, knowing that someone went to Walmart around noon doesn't say much. They could have done a lot. Knowing they specifically bought ____, is a different matter.

Hell, that is the entire point to this. You are having to aggregate lots of meta-data to determine the data of the call itself. How is this even an argument?


The data of the call isn't important. That's the point. What you say matters less than what you do. And you do a lot more than you say.

To put it another way, Google doesn't know that I am interested in a new iPod because the I used Gmail to tell my friends, "I'm interested in a new iPod." They know it because I have been browsing pages offering iPods for sale. And that says a lot more about when I plan to purchase one than the statement sent via Gmail.

I didn't tell Kroger, "Hey, I've got a dog." I bought dogfood - and now that Witty is dead, they can tell from the change in my purchase habits that my big dog died - they don't print me coupons for Alpo anymore when I check out.

The content of the call indeed contains information beyond the meta-data about the call. But it's not just meta-data about voice calls that's being collected, it's meta-data about all modes of mobile and static communications - and geolocations when calls are not being made.


But even in your example the content of the pages is what's important, and that is data, not metadata. Metadata might be stuff like "the search for '$FOO' was done from IP address '$BAR' at time '$BAZ'". And you could even argue that $FOO itself is really data, not meta-data.

Likewise the contents of your receipt is data, not metadata. It's literally the bill of goods that authorizes you to take the named items from the store and (with your additional purchase creds) authorized Kroger to charge your credit card $X.XX. Kroger may have inferred more than you wished, but they did it with the data.


One man's data is another man's meta-data.

My receipt is not the items I purchased. It is information about them. E.g.

    Power Systems: Conversations on Global
    Democratic Uprisings and the New 
    Challenges to U.S. Empire,                 $18.75
is not the text itself. It is information about something I did, not something I read or might have read or might read.

Likewise, my purchase of dogfood was not dogfood, nor my dog nor my feeding of it.

It's turtles all the way up.


'Information' is also a meta-data of that "meta-data" above, which one is more useful? Can anyone tell what you did when given meta-data of "information"?

Can you derive back to what you did, given 'information' as meta-data?


If it's turtles all the way up then it's turtles all the way down.

The Supreme Court has already ruled that metadata is not warrant-protected, so I'd be highly leery of equating data with metadata. (edit: spelling fix)


That you can determine things in multiple ways is a given. You can often infer the information of the calls. Often with decent accuracy. If you have that information, there is no need to infer it.

Take your Kroger example. You could have just stopped buying dog food from them. We completely stopped buying dog food from a store because we had trouble getting the dogs to take to the food without skin problems. Is it valid for them to assume that our dog is dead? Or just stick with the safe assumption that we aren't buying dog food from them anymore? :)

Now, I completely agree with the idea that collecting just the meta data is already pretty far reaching. I struggle to see why collecting it all is not even more far reaching.


It may be more far reaching. But the reach is gratuitous and drawing the line at the content of calls doesn't meaningfully curb the invasion of privacy.

If I stop pissing in the ocean, I'm not going to prevent rising sea levels.


On this point I can essentially agree. I just find the argument that the data that can be inferred is meaningless to be kind of odd.


To me, 'metadata' is a confusing word for this type of data. People in the ed-tech space have begun calling this type of data 'paradata' [1] because it sits beside the data.

This helps make it distinct from metadata. Metadata then describes the data itself (tags, classifications, etc.), whereas paradata describes the data's lifecycle and use (e.g. who used it where and when how, etc).

[1]: http://en.wikipedia.org/wiki/Paradata_%28Learning_Resource_A...


Metadata is not (necessarily) located above data - they are (usually) located next to each other and the metadata augments the data.


I have to disagree. One could probably extract most of the information in your example from one of the phone calls and very likely even more information you can not extract from metadata alone. But there is of course also information you can only extract from metadata because it is never explicitly verbalized.


[Beep] "Mom, call me."

It's the timing and individuals and the sequence of events which matter.


The information will probably be in the next call.


Yes, yes, but it is already in the meta-data: Even if the patient drives to Mom's house after getting voice mail - in which case there will not be the next call but there will be meta-data regarding the location of the patient's cell phone.


This is just no true - if I buy a used car and then call somebody to tell him that I just bought a black Audi A6 there is no (easy) way you can figure out these details from metadata alone.


How can there be meta-data of location of patient's cell phone, but not data of patient's cell phone? What is that which "data" is that "meta-data" the "meta" of?


"It depends on what the meaning of the word 'is' is. -- Bill Clinton

And now I am reminiscing fondly about the days when our biggest national concern was a blowjob.


So meta-data of that would be "[Beep]"


I don't think people are worried about the NSA breaking into their house or faking their identity, nor do I think anyone is concerned that the NSA will hear them talking about how they have been diagnosed with cancer.

I think people are worried that they will say something well within their first amendment rights on a seemingly private channel, that will then get them marked as a "dissenter" of some kind.


It doesn't have to be the NSA that breaks into your house. It could be an outside analyst selling information to a guy who knows a guy etc.

I don't think anyone, be it government or corporate, should be capturing and storing detailed data like this for extended periods of time. It's just asking for trouble. I haven't seen any proof of a worse case scenario more severe than ads being a bit less targeted if strict limits were placed on data retention and identification.


To reiterate your point, there has to be a strong chain of authority for the life f the storage. And they probably would completely replace 3 racks of storage with 1 every 3 years given increasing drive density. So that would be a lot of opportunity for a contractor to bolster their income by selling drives off on ebay.

http://web.mit.edu/newsoffice/2003/diskdrives.html


Here's how I'd do it: Everything gets chucked into a secure storage. Meta-data (from processing the data) goes into the dataming system (where it's probably shared with a lot of projects, somewhat securely). If the data mining indicates a "hot" message, the original message (and associated messages) will be sent to an analyst.

Can anyone think of a more likely scenario?

By "secure", I mean "not shared within the agency". Obviously, everything is locked down within the agency, but there'd be a limit to how much the staff would be allowed to see (except the sysadmins, like Snowden, who need wide-ranging access).


One party can obfuscate meta data by, for example, using a "burner phone" or by routing VOIP calls through 7 proxies.

The only way to obfuscate the content of a call is if both parties agree in advance to speak in code.

"Hi honey, the sparrow has left the nest. There may be a bee in the bathwater."

"John is that you? What the heck are you talking about? How did the cancer test go?"


At this point, I think that to some extent the concern about voice recording can be reduced to: "Fool me once, shame on you. Fool me twice, shame on me."

The public was lied to. At the least, a lie of omission.

The lying authority figures are now attempting to (re)establish their credibility -- or at least hang on to their authority and power.

Fool me twice?


I wouldn't be shocked to learn that their interpretation of "not listening in" does not include reading speech-to-text transcripts of conversations.


Agreed. While speaker-independent speech recognition has ever-so-slowly progressed to the point where it only mostly sucks it's probably good enough to log a general transcript. Combine that with the temptation to store compressed audio at 1KB/sec and 1.5GB/year would likely easily cover fully documenting every conversation you have without "somebody listening in." This assumes 2000 minutes/month, way more than I use, but I need to average in females... Divide that by two if you simply log per-conversation instead of per-phone and we're at 0.75GB/year/person.

I think we can safely assume that you have privacy only when speaking in-person, and that's if you're not under some sort of investigation. Otherwise, find a windy place outdoors (near a large body of water for dramatic effect) and mumble through your conversation even if you're just a Hipster preparing for next weekend's party.


I suspect it means "We don't have a guy listening to your line in real time."


Or even better: "We said we aren't listening to your phone calls. We didn't say we aren't listening to a lot of phones, just (probably) not yours.

Their word manipulations are sickening. I just wish all of this went to a trial already and to the Supreme Court. Let's see them lie through their teeth to the judges then, who will actually understand everything they're saying or avoiding saying (unlike most of the press, or people out there).


But even if they lost at the Supreme Court, wouldn't they just appeal to the Secret Supreme Court?


Anyone else getting this..?

Technical Details

        www.schneier.com uses an invalid security certificate.
The certificate expired on 18/06/13 11:55. The current time is 18/06/13 14:02.

(Error code: sec_error_expired_certificate)


Funny, I'm seeing a lot of threads about NSA violations which mention comparable "don't go to this site" warnings. I've seen more in the last few days than the prior several years.


I am assuming that everyone who got the SSL cert expiration warning is using HTTPSEverywhere by the EFF?


Yup, its expired.


another reminder of false security of certificates, because no one pays attention to them. Here's a security blog, so I will speculate that many of its readers are security conscious users, yet most probably went to the site anyway despite the security warning of an expired certificate.


What's the threat model for this? Is the MITM going to subtly change Schneier's essay? Perhaps they're going to find out the lame password I use for stupid comment forms, which password (literally: it's a word) I've been using continuously for that purpose since 1994?


Or inject an exploit into the stream and compromise the system being used to view it.


I would not think that is much of a threat here in this scenario. The threat is really that Schneier is no longer who he said he is, since it has not been validated recently. That is, his certificate, purchased and verified through an authority, has reached the age where that authority no longer guarantees that he is the one holding it. As such, someone else could have taken over his person and began acting maliciously.

Right?

Edit: So, my question "Right?" was a legitimate question. If I am wrong, I'd like to know how. Note that this is an expired, non revoked certificate scenario we are talking about. Meaning the identity was established before, and to nobody's knowledge has it been stolen. Simply now that identity has not been established for a long time.


If you put any faith into the CA systems verification process then you'd be correct.


But if you don't have faith in the CA system, then what is the additional concern over an expired cert?


So you never browse to sites that don't use TLS? That seems far more limiting than say, keeping your browser up to date and keeping Java turned off.


You have a point, I do think that a majority of the internet populace would not raise a flag about this, but....also please notice that the URL linked on this page is simply HTTP.

So, either you (and the others in this comment thread) manually changed the URL, or you have some browser extension that automatically switches to HTTPS, etc. I'm sure more of us would've caught the issue if indeed more of us were exposed to the issue in the first place.


using httpseverywhere so that explains it, thanks.


Well an ssl cert that just expired is still encrypting data in flight. Most everyone that has dealt with security knows that people sometimes miss renewing the certs. Expired cert doesn't mean the site is now malicious.


But it means that, if you accepted the insecure certificate, you might've accepted a MITM-crafted certificate.

It doesn't mean that it's malicious, but it doesn't mean it's not.


Yes.


Why would the NSA not be able to store all phone calls as audio? If we assume 30 minutes per person per day to store at compressed 20kbps at $0.12 per gigabyte per year (this is what you pay at Amazon), that costs $60 million to store for the data of a year's worth of phone calls of the entire population of the US. The budget of the NSA is more than $10 billion, so that's less than 0.6% of its budget. That is entirely doable. Note that I have grossly overestimated the cost here by assuming that price per gigabyteyear is what you pay Amazon, and 30 minutes per person per day is probably an overestimate, and the compression could be better too. The actual cost is probably closer to $10 million; less than one thousandth of the NSA's budget.


Also, Amazon Glacier is only $0.01 per GB.


In 1998 a colleague of mine received a DARPA request to develop a tape-based recording device that -- it was obvious from some quick calculations -- would be capable of archiving all voice communications going in and out of the country. We could think of no other application for that kind of technology, and assumed that the real client was the NSA.

At the time, storing such a quantity of data was completely infeasible within the physical space & budgetary constraints of the proposed program was completely infeasible. We told them so, and the project went away. However, given that:

1.) They were trying to do this 15 years ago, and

2.) Both their budget and the technological state of the art has improved substantially since then, and

3.) The government these days seems happy to treat citizens' rights with the same general contempt as non-citizens' rights, given some creative re-definitions of terms and rubber-stamp lawyering...

...I would not be at all surprised to learn that a program like this was in fact well-established by this point.


The more I read about the subject the more I believe all this data is primary used for commercial reasons.

What else is to gain from monitoring loads of targets without obvious security reasons?


1. To stop dissent.

More likely:

2. To award huge government contracts to the collectors and grow the institution at the same time.


The biggest component of espionage is always industrial espionage.


Exactly. Personally, I'd be playing the currency markets if I had access to all this data. I'm curious how many people who are monitored are politicians and financiers (bankers/investors). The potential to abuse other economic markets relative to your own is enormous.



But if you investigate something and a new suspect emerges, you want to hear and analyze his calls in the past, don't you? And how can you achieve this if not by collecting all phone calls?


But if you investigate something and a new suspect emerges, you want to hear and analyze his calls in the past, don't you?

I do? If they are suspect in a crime for specific reasons, why are those specific reasons not enough leads to follow?

As Chomsky said, if you're so worried about terrorism, stop participating in it... as in, The Iraq "War" alone killed so many more people than 9/11, and nobody could even be arsed to look at the debris of the latter before destroying it. This is akin to a wolf saying the chicken need to tell him where they are at all times, to better protect them. Feel free to fall for that once, but then please join the rest of us in the real world.


Blackmailing. Determining who is likely to whistle blow.


Major media reported that the Boston bomber's wife's phone calls she made prior to the bombing were listened to, after the bombing.


NSA is testifying right now in front of congress BTW

Oh wait no, it's the Deputy Attorney General

He says the 4th amendment doesn't apply to phone records.

Footer says "NSA director to reveal terror plots stopped by surveillance"



> And, by the way, I hate the term "metadata." What's wrong with "traffic analysis," which is what we've always called that sort of thing?

This has also been bugging me. Metadata is a very general term, and it doesn't explain what the NSA claims its doing (whether they're doing anything else is beside this particular point). Moreover, the use of such a general term seems like it's part of the propaganda, to make us less scared: "We're not collecting data, we're collecting meta-data." Well, it turns out that they are one and the same anyway.


This isn't evidence the NSA is storing voice content, this is Schneier saying, "One reason I used to discount the idea that the NSA was storing all phone conversations was that it'd be too much data but now I don't think it is."

Not new evidence.


Obama says [1] that if you're a "US person" (whatever that means), "The NSA is not listening to your phone calls."

Does storing them for later use count as listening?

[1] http://www.cbsnews.com/8301-250_162-57589732/obama-on-nsa-pr...


Not according to Clapper. I'm sure they're all on message with their phrasing. They are no doubt asserting that when you pick up a phone, someone is probably not listening to your phone call at that moment, as if we need assurance of that.


Lossy compression turns any data into metadata.


To query the database, you need to know the phone number's area code is inside the united states (...? like that's fucking difficult?), then you must get "a further review" to see if they are "just expressing their first amendment rights", and then one of 20 analysts and 2 managers must approve it.

Go check out all the things that are not protected by the first amendment. http://en.wikipedia.org/wiki/United_States_free_speech_excep...

If we held people accountable for all the false statements of fact they make, FOX News would have been off the air years ago and all their newscasters thrown in jail. Basically, millions of people could be subject to database queries, considering how many loopholes there are.

One example they cite using the 215 was the NSA provided a phone number to the FBI, the FBI served notice to the court to find out who the number belonged to, and they then arrested and convicted the guy for giving money to a foreign organization that the USA labels a terrorist organization. So don't do business with anyone who might know a group of freedom fighters.


Lots of debate here on what is more powerful, the data or the metadata. I think in the case of voice content, a case can be made that the metadata is more powerful, simply because it's already parsed into a quantitative, objective format that's relatively easy to analyze en masse and find "suspicious" patterns within the public. With voice content, I don't think our natural language processing chops are up to par to deal with the massive amount of data and connect the dots in meaningful ways.

Of course, if you're already targeting a specific individual, and you can get a human to listen to the voice content, the debate is academic - both kinds of data compliment each other and are equally powerful. Metadata still acts as a better "gateway drug" for narrowing down individuals though.


CNN already broke this story in early May. http://www.youtube.com/watch?v=vt9kRLrmrjc&feature=share


It is considerably more credible when an accepted industry expert commentates, over the mainstream media.


Did you watch the interview? They have an FBI agent talking about it. Do you think he's not really an FBI agent?


Well given that youtube can store such vast amounts of video the suggestion that the NSA is storing lots of voice doesn't seem like a massive stretch.


I think what this points out clearly, and Schneier has been preaching this for years, is that we need to encrypt everything going forward.


Speaking for myself, I'd like to encrypt the content of my communications as much as possible. But I can't control or encrypt the meta-data, such as numbers I dial, how long I talk, etc.


I agree, but good luck getting the average internet user to do so, and their refusal will prevent you from effectively doing so as well whenever you want to communicate with them.


My experience has been that commercially available speech-to-text technology doesn't work very well on non-US-accented English. Maybe fixing that tech will be one benefit of PRISM. You know, just like we thank the Apollo space program for Velcro and pens that can write upside down.


Yep! We've actually known this for years. https://www.eff.org/deeplinks/2010/03/wiring-big-brother-mac...


So it turns out we should have actually listened to Shia LaBeouf back then:

https://www.youtube.com/watch?v=dNRgP4FVDzA


How about the simple fact that you don't need multiple data centers with capacities measured in exabytes to store "metadata"?


Surely a compressed call recording is also 'metadata'?


security certificate expired!


Warning: this article does not present any evidence.

Schneier is a great cryptologist, I've read his books, I've carried them around, I'm a big fan of his work.

BUT.

He's seriously lagging behind in his coverage of this scandal, it's like he's just reposting what others have already said and often stating the obvious.

So the NSA has the capability to store voice? No, really? Like since 1940????

Sorry, Bruce, but this article is shit.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: