Tag Archives: data

Digging through what Twitter knows about me

I joined Twitter on February 21, 2007, at exactly 15:14:48, and I created my account via the web interface. As you can see, my first tweet was pretty mundane!

I remember discussing this exciting cool “new Web 2.0 site” with Kim Plowright @mildlydiverting in Roo’s office in Hursley a couple of days before, and before long he, Ian and I were all trying this new newness out. It was just before the 2007 SXSWi, where Twitter really started to get on the radar of the geekerati.

But wait a moment! It’s impossible to pull back more than just over the last 3,000 tweets using the API, so how was I able to get all the way back to 5 years ago and display that tweet when I’ve got over 33,000 of them to my name?

It’s a relatively little-known fact that you can ask Twitter to disclose everything they hold associated with your account – and they will (at least, in certain jurisdictions – I’m not sure whether they will do this for every single user but in the EU they are legally bound to do so). I learned about this recently after reading Anne Helmond’s blog entry on the subject, and decided to follow the process through. I first contacted Twitter on April 24, and a few days later faxed (!) them my identity documentation, most of which was “redacted” by me ūüôā Yesterday, May 11, a very large zip file arrived via email.

I say very large, but actually it was smaller than the information dump that Anne received. Her tweets were delivered as 50Mb of files, but mine came in nearer to 9Mb zipped – 17Mb unzipped. I’d expected a gigantic amount of data in relation to my tweets, but it seems as though they have recently revised their process and now only provide the basic metadata about each one rather than a whole JSON dump.

So, what do you get for your trouble? Here’s the list of contents, as outlined by Twitter’s legal department in their email to me.

– USERNAME-account.txt: Basic information about your Twitter account.
РUSERNAME-email-address-history.txt: Any records of changes of the email address on file for your Twitter account.
– USERNAME-tweets.txt: Tweets of your Twitter account.
– USERNAME-favorites.txt: Favorites of your Twitter account.
– USERNAME-dms.txt: Direct messages of your Twitter account.
– USERNAME-contacts.txt: Any contacts imported by your Twitter account.
– USERNAME-following.txt: Accounts followed by your Twitter account.
– USERNAME-followers.txt: Accounts that follow your Twitter account.
– USERNAME-lists_created.txt: Any lists created by your Twitter account.
РUSERNAME-lists_subscribed.txt: Any lists subscribed to by your Twitter account.
РUSERNAME-lists-member.txt: Any public lists that include your Twitter account.
– USERNAME-saved-searches.txt: Any searches saved by your Twitter account.
– USERNAME-ip.txt: Logins to your Twitter account and associated IP addresses.
РUSERNAME-devices.txt: Any records of a mobile device that you registered to your Twitter account.
РUSERNAME-facebook-connected.txt: Any records of a Facebook account connected to your Twitter account.
РUSERNAME-screen-name-changes.txt: Any records of changes to your Twitter username.
– USERNAME-media.zip: Images uploaded using Twitter’s photo hosting¬†service (attached only if your account has such images).
Рother-sources.txt: Links and authenticated API calls that provide information about your Twitter account in real time.

Of these, let’s dig a bit more deeply into just a few of the items, no need to pick everything to pieces.

The “tracking data” is contained in andypiper-devices.txt and andypiper-ipaudit.txt – interesting. The devices file essentially contains information on my phone, presumably for the SMS feature. They know my number and the carrier. The IP address list tracks back to the start of March, so they have 2 months of data on what IPs have been used to access my account. I’ve yet to subject that to a lot of scrutiny to check where those are located, that’s another script I need to write.

I took a look at andypiper-contacts.txt and was astonished to find out how much of my contact data Twitter’s friend finder and mobile apps had slurped up. I mean, I don’t even have all of this in my address book‚Ķ given the fact that the information contained the sender email addresses for various online retailer newsletters, I’m guessing that Google’s API (I’m a Gmail user) probably coughed up not just my defined contact list, but also all of the email addresses from anyone I’d ever heard from, ever.

Fortunately, there’s a way to remove this information permanently, which Anne has written about. I went ahead and did that, and then Twitter warned me that the Who To Follow suggestions might not be so relevant. That’s OK because I don’t use that feature anyway – and in practice, I’ve noticed no difference in the past 24 hours!

I use DMs a lot for quick communication, particularly with colleagues (it was a pretty reliable way of contacting @andysc when I needed him at IBM!). That’s reflected in the size of andypiper-dms.txt, which is also a scary reminder to myself that I used to delete them, but since Twitter now makes it harder to get to and delete DMs, I’ve stopped removing them and there’s a lot of private data I wish I’d scrubbed.

Taking a peek at the early tweets in andypiper-tweets, I’m trying to remember when the @reply syntax was formalised and when Twitter themselves started creating links to the other person’s profile. Many of my early tweets refer to @roo and @epred and I don’t think they ever went by those handles. 5 years is a long time.

I mentioned that the format used to deliver the data appears to have changed since Anne made her request. She got a file containing a JSON dump of each tweet including metadata like retweet information, in_reply_to, geo, etc etc.. By comparison, I now have simply creation info, status ID (the magic that lets you get back to the tweets via web UI), and the text itself:

********************
user_id: 786491
created_at: Wed Feb 21 15:43:54 +0000 2007
created_via: web
status_id: 5623961
text: overheating in an office with no 
comfort cooling or aircon. About to drink water.

It’s a real shame that they have taken this approach, as it means the data is now far more cumbersome to parse and work with. However, using some shell scripts I did some simple slicing-and-dicing because I was curious how my use of Twitter had grown over time. Here’s a chart showing the numbers of tweets I posted per year (2012 is a “to date” figure of course). It looks like it was slow growth initially but last year I suddenly nearly doubled my output.

Still considering what other analysis I’d like to do. I can chart out the client applications I’ve used, or make a word cloud showing how my conversational topics have changed over time‚Ķ now that all of the information is mine, that is. It is just a shame I have to do so much manual munging of the output beforehand.

Oh, and the email I received from Twitter Legal also said:

No records were found of any disclosure to law enforcement of information about your Twitter account.

So, that’s alright then‚Ķ

Why did I do this? firstly, because I believe in the Open Web and ownership of my own data. Secondly, because I hope that I’ll now be able to archive this personal history and make it searchable via a tool like ThinkUp (which I’ve been running for a while now, but not for the whole 5 years). Lastly… no, not “because I could”‚Ķ well OK at least partly because I could‚Ķ because I believe that companies like Twitter, Facebook, Google and others should be fully transparent with their users and the data they hold, and that by going through this currently-slightly-painful procedure it will encourage Twitter to put in place formal tools to provide this level of access to everyone in a frictionless manner.

If you’ll excuse me, I’m off to dig around some more‚Ķ

Postcodes should be free?

Free The Postcode! 2.jpg

Something I picked up from a tweet recently (can’t remember who from) was the effort to create a free database of UK postcode data via a site called Free The Postcode!. For those who don’t know, UK postcodes are essentially the same as zip codes in the US – Wikipedia tells me that we’ve had them in this country for 50 years now, since 1959.

note: in what is now the dim and distant past, I used to work for the Post Office’s IT division. I have no current association with the UK Post Office and what follows are entirely my own random thoughts on the subject.

The essential thrust of Free The Postcode is this: the Post Office currently charges people for access to their database beyond a certain number of queries per day. [I think they used to send updated copies of PAF (the Postal Address File) on CD to companies every month or so – presumably there’s now some online mechanism for distribution but I have no idea]. Much as Wikipedia has “freed” the world from having to buy hardback copies of Britannica, and OpenStreetmap is crowdsourcing a global map which is not bound by Ordnance Survey fees or Google control, wouldn’t it be great if we could do the same thing for postcode and address information in the UK?

Well… I guess. There are a few problems that I can see with the approach. The first is that only the Post Office can allocate, update and change postcodes in an area. In fact, every now and then they have done this over wide areas (Southampton’s SO codes were all changed or reorganised in the last ten or fifteen years I believe). The second is that in order to submit postcode information you need to know your GPS location (not so difficult these days with GPS being built into an increasing number of mobile devices) and the postcode you are currently in. Now, unless you are at home or at the office, this is potentially a bit more tricky – so actually building this free database could take a very long time. Also, in order to draw accurate boundaries, you will need a lot more than a single reading from each postcode area.

In the interests of experimentation, and my doubts notwithstanding, I thought I’d give this a try. There’s a free iPhone application called iFreeThePostcode which works out where you are and then allows you to submit your location and the postcode online (by the way, there are also Android applications, or a web form).

iFreeThePostcode 2.png

A couple of interesting points here. Firstly, I found it fascinating to see how long it took my phone to get a location lock with better than 50m accuracy – it started off at over 1000m and gradually narrowed itself down (jumping up over 300m on a reasonably regular basis). The other thing is that I had to fill in my email address in order to go through a validation process – they send you an email with a “confirm that you submitted this” link, presumably to avoid spammers. That’s fine, but as many of the comments in the iTunes store reviews say – I’m handing over personal data here, and there’s no statement as to how that might be used. To be fair, the current content of the database is available as public domain, but that doesn’t mean that the people gathering the data don’t have other purposes.

Besides that, there are some interesting legal discussions on the associated wiki page, and no overall stated privacy policy for the project.

If you’re interested, by all means give the FreeThePostcode project a look. I can’t quite say whether or not I’m in favour of the idea – frankly, I think this is a tricky problem to solve through crowdsourcing.

Update: the source code of the iFreeThePostcode app is available.

Sharing large files – drop.io

In the past week I’ve had two separate conversations with people who wanted to know a way of posting large(ish) files on the web for temporary purposes, i.e. just to let a couple of people download something without going via email.

I don’t have a definitive answer, of course. The traditional way would be something like an FTP server. There’s Amazon S3 too.

The service I’m increasingly using is drop.io – a really simple way of temporarily sharing files up to 100Mb in size. There’s no sign-up or account required. You simply specify a drop name (so I could create a name of “andyptemp” or similar, and it would end up having a URL of http://drop.io/andytemp) and then specify a time limit of between 1 day and 1 year after which the drop will be deleted. Then you can add as many files as you like up to 100Mb for the drop. You can add a password for access if you like. You can specify whether other people can just download / view the files, or add their own. And that’s it.

If you decide to use the service, one hint I’d give is to set the “optional” admin password for your drop when prompted, as it means that you can go back in later and see how many people have downloaded files, as well as adjusting the “self-destruct” time of the drop.

There are some other really cool features like the ability to have an RSS feed of the drop, get email alerts, post MP3 files via a phone number, fax documents into it… a bunch of things I’ve just not needed to play with yet… but it’s a nice service, and appears to work well.

(NB that drop http://drop.io/andytemp is live for the moment, and it is set to read/write, but in time it will delete itself. Have a play if you like…)

(update: actually on reflection I’ve made it read-only as I should have realised that this means anyone can upload anything and I can’t vouch for whatever is uploaded, which was a bit short-sighted – ordinarily of course you’d only share the URL with folks you know)