Where are the tweets? I asked myself. Why can’t I search them now, now! After all, I know there are interesting tweets out that that are no longer searchable through Twitter’s search engine. They are, no doubt, in the data sent to the LOC — as is every other tweet sent since the beginning of Twitter.
I wasn’t the only one wondering, it seems. Audrey Watters at the O’Reilly Radar blog posted an update yesterday, explaining that making the entire archive of tweets available to researchers is no easy matter and could take a long time.
You can read Watters’ post for the details. Suffice to say, an individual tweet is much more complicated than I thought, and the LOC has billions of them to index.
Each tweet is a JSON file, containing an immense amount of metadata in addition to the contents of the tweet itself: date and time, number of followers, account creation date, geodata, and so on. To add another layer of complexity, many tweets contain shortened URLs, and the Library of Congress is in discussions with many of these providers as well as with the Internet Archive and its 301works project to help resolve and map the links.
More on the architecture of a single tweet:
Digital archivists told Watters that building the infrastructure to index the Twitter archive could take quite a while, and — here’s the disappointing kicker — even when they have that method in place and are indexing tweets like mad, the archive will likely only be available to “known researchers” who will need LOC approval to access the archive.