"Microsoft logged everyone's MSN IM conversations for 6 months"
This title is sensationalist and inaccurate. The article says that the dataset consists of one month of logging:
"'The compressed dataset occupies 4.5 terabytes, composed from 1 billion conversations per day (150 gigabytes) over one month of logging,' according to the researchers."
"We present a study of anonymized data capturing a month of high-level communication activities within the whole of the Microsoft Messenger instant-messaging system."
How anonymous can your own voice be? Not very much, I don't think.
A checkbox on a survey can be anonymous, but my communication with other people would be impossible to record "anonymously." Remember all that flack Netflix got for releasing its "anonymous" records of people's movie-watching habits?
Just because my name is omitted doesn't make it morally unobjectionable to release the information without my consent, but a lot of huge companies have been doing this more and more often.
You shouldn't expect privacy when sending information to someone else's server.
Google records your searches, pg monitors voting, and TiVo knows what you're watching on TV.
The data associated with a user is valuable in improving the product and makes for interesting research. If you want anonymity, you have to explicitly try to be discreet.
When I use MSN, I'm sending data to the other person through the Microsoft tunnel; if they clone the data while it's going through the tunnel and save it for their own use, that's kind of a breach of my trust.
I don't expect them to NOT do this, but I still believe it's uncool.
Also, there's a huge difference between recording my personal conversations and recording my movies, searches, or voting habits.
Why not? I can have a personal conversation with someone else verbally over the air, but I can't have a personal conversation with someone else over bits?
You're not having a personal conversation with someone else over bits, you're having that conversation with someone else through an intermediary, specifically the hardware of your respective ISPs and any other third parties between whom the packets route. That isn't so much a personal conversation, as a very accurate game of Telephone.
As an aside, I apologize for not responding sooner, I did not know how to find my old posts.
All of our data was anonymized; we had no access to personally identifiable information.
Also, we had no access to text of the messages exchanged or any other information that
could be used to uniquely identify users.
That's enough for anonymized IDs, IPs, timestamps of start/finish (and probably individual messages in a many-message conversation) -- but not full transcripts or recorded voices.
The inflection of the k-core graph at 20 is interesting. Rather than a fundamental property of human nature though, I'd speculate that this is the direct result of school class sizes.
All of our data was anonymized; we had no access to personally identifiable information.
Also, we had no access to text of the messages exchanged or any other information that
could be used to uniquely identify users.
This title is sensationalist and inaccurate. The article says that the dataset consists of one month of logging:
"'The compressed dataset occupies 4.5 terabytes, composed from 1 billion conversations per day (150 gigabytes) over one month of logging,' according to the researchers."