If it didn't, the NSA would not be so interested in collecting it. Paranoid people who believe they might be being listened in on are unlikely to reveal much directly in the conversation itself anyway, so in those cases the metadata is more important. Also, the metadata can reveal anomalous behavior, which they look for mainly because it's easy to find, but also because it can reveal important information assuming the targets are correctly selected.
Anyway, the only reason they aren't collecting the calls themselves is because the storage required is not yet available, so their begging off about "but we aren't storing the content" is disingenuous. They can, at any moment, capture any call going over the long-distance network (which would include pretty much all cell phone calls) so the only thing they're unable to do is to retroactively listen in on calls. If you are flagged for whatever reason (you know someone who knows someone who knows someone they suspect of sending money to a bad charity), you may well be being monitored.
> If it didn't, the NSA would not be so interested in collecting it.
I'm going to quote Schneier.
Stop calling it metadata. Call it what cryptographers and security professionals have called it since forever: traffic analysis.
Traffic analysis is a powerful tool by itself. Combined with practically unlimited access to source material, and the ability to unmask almost every communication node, you don't even NEED to care about the contents. Timing, frequency and direction of communication is more than enough.
The amount of storage required for storing all of the call recordings (GSM, VoIP, land-line) are currently available. For example, Speex [1] can compress voice even in 2kbps. So storing everything e.g. in 8kbps you can store 916259 hours (104 years) of voice in just one 3TB disk.
Let's take the US. 317 million people, assuming they call for an average of 10 minutes (based on nothing whatsoever, btw), gives approx. (10317) / 60 = 53 million hours of phone conversations a day.
Given 916.259 hours = 3 TB, 53M / 916.259 = 57 3 TB = 172 TB of data /a day/. And that's just the US. Even if you adjust the average to just one minute a day, you're still looking at 17 TB / day, which should be sorta manageable I reckon.
But it's not just the US. Let's assume they want to track all voice communications globally, rounding it to an even 7 billion, ten minutes a day. I'm counting 3819 TB/day. That's a lotta 3 TB hard drives.
tl;dr: big data is big.
disclaimer: I suck at basic arithmetic, I probably made a miscalculation.
~200TB/day is entirely within the room the NSA's budget allows.
You don't need to record all phones in the world if you have the metadata of all calls in the world. The NSA or another spy agency could record only calls that match a given pattern: place from which the call originates, who is being called, time of call, previous calls made or received by the line, when the line or phone was purchased, whether this phone's call patterns resemble another phone's call patterns, etc.
They wouldn't achieve 100% coverage, but efficacy would probably be 99%+.
I would guess the problem with this kind of semi-targeted collection is processing power to decide who is a target and schedule the line taps.
Or, record all of it that you can into a daily or weekly cache, and then keep in indefinite (expensive) storage the things which are Statistically Interesting but outside our current budget/capabilities to store forever.
Yes, good point. You don't have to store everything forever, you can have tiers of interestingness — capture everything, read it looking for certain patterns, store anything that matches those patterns forever, store stuff that matches [secondary, less important but still useful pattern] for a few years, and store all other calls for a few months or a year.
Phone calls are quite low quality audio, but I don't expect the NSA to be limited to consumer grade text-to-speech technology, so at least for calls in some languages, they could store the transcripts forever.
EDIT: Apart from processing power, another expensive problem with such a setup is memory to store the firehose temporarily.
EDIT 2: If you were wondering, 200TB/day would run at $7500 for 50 4TB external hard drives at $150 each, assuming you wanted to use a Backblaze-like setup. In a year, that's $2.7 million. (This doesn't account for redundancy.)
Well, for starters, if most calls are internal to the US, half of those ten minutes will be shared with another phone (unless you're just counting outgoing calls, not ingoing calls, for the "magic" 10 minute number). So if you're talking average talk time, you need to divide by two. The effect of conference calls is probably(?) negligible however.
Now, lets take a number, say 4PB/day, or 120 PB/month, say that we put each call to S3, and that each call lasts 2 minutes (on average), so 5x317x30x10^6~ 1.5 Billion puts/month, and lets say 1.5 Million gets, and a full 120PB in/out internal transfer -- that comes to just about 19M USD/Month on S3. That's certainly within NSAs budget -- and Amazon can provide that with a profit margin (assuming, they can, in fact, provide that service).
I think you made an arithmetic mistake. Look at it another way:
10 minutes per person per day * 8kbps = 600 kB per person per day.
600 kB / person / day * 365 days /year = 214 MB / year.
That's nothing. Consumer flash media is something like $0.30/GB. Let's mark that up 100x because the three letter agency doesn't care about costs and has an inefficient procurement process, so $30/GB.
0.209 GB / year * $30 / GB = $6 per person per year.
There are 300 million people in the US, but phone calls are between at least two parties, so:
300 million people * 0.5 * $6 per person per year = $900 million per year.
You can't even build a mile of highway for that little. Hell, some big cities in the US have a bigger annual deficit than that.
Count me among those who think the storage is doable. The transmission may be a bottleneck: effectively you're doubling the bandwidth requirements for phone traffic.
With local tapping and storage facilities with some mechanism for cache-and-forward including the enduring favorite: a station wagon full of tapes, or perhaps a panel-van full of flash drives, this remains within the realm of possibility.
You can actually do much better than 2 kbps. MELP-E (unfortunately patented), actually can do 600 bps (and is fairly robust to noise I might add). With the advent of great speech-to-text ML, you could actually reduce this even further, to the 100-200 bps range).
I'll quote the slip up of Tim Clemente, a former FBI agent:
> All digital communications are uh uh... There's a way to look at digital communications in the past. And I can't go into detail of how that's done or what's done but I can tell you that no digital communication is secure.
And even if they are not storing the audio recordings, wouldn't it be reasonable to expect that automatic transcription is run against the audio for search indexing and long-term storage purposes? Text transcriptions like that, even if they are imperfect, can preserve conversations with minimal storage requirements almost indefinitely.
Are you quite sure they aren't collecting content? The storage is feasible, and several remarks (slips?) by people in a position to know suggest that they are in fact doing this, if not to everyone, then to hundreds of thousands or even millions of people.
> Are you quite sure they aren't collecting content?
It seems to depend entirely on the program in question, and the legal authorities for each program.
For foreign intelligence actually collected overseas under EO 12333 they have programs that collect content (e.g. SMS), which have existed in one form or another since the beginnings of the Cold War.
For foreign intelligence conducted on U.S. soil they have to fall within the boundaries of Fourth Amendment reasonable search (and related legal rulings like Smith v. Maryland) so they wouldn't capture content. But on the other hand they can still capture content in targeted fashion against non-U.S. persons in certain scenarios, and that content might itself involve conversations involving a U.S. person and still be a legal search.
It seems to depend entirely on the program in question
Accessibility does, but it has nothing to do with the agency. "Does the government store the content of US calls?" can be answered in the aggregate without reference to specific programs. If the content is there, an agency only has to acquire the authorities required to access what is already there.
So, the question remains: is the content there to access, given proper authorities?
There's multiple levels of authorities though. Collecting the calls at all would itself require some legal authority. Searching the calls you collected would then need yet another legal authority.
Anyway, the only reason they aren't collecting the calls themselves is because the storage required is not yet available, so their begging off about "but we aren't storing the content" is disingenuous. They can, at any moment, capture any call going over the long-distance network (which would include pretty much all cell phone calls) so the only thing they're unable to do is to retroactively listen in on calls. If you are flagged for whatever reason (you know someone who knows someone who knows someone they suspect of sending money to a bad charity), you may well be being monitored.