The U.S. Library of Congress said today that it has completed a process of collecting a full, ongoing stream of tweets, and that it has begun work to archive and organize more than 170 billion tweets.
Under an agreement struck between the government institution and Twitter in 2010, the microblogging company is providing the Library of Congress with a full stream of all public tweets, starting with 21 billion generated from between 2006 and April 2010, and now supplemented with about 150 billion more posted since then.
In an announcement about the status of the project today, the library wrote that:
Twitter is a new kind of collection for the Library of Congress but an important one to its mission. As society turns to social media as a primary method of communication and creative expression, social media is supplementing, and in some cases supplanting, letters, journals, serial publications, and other sources routinely collected by research libraries.
Though the Library has been building and stabilizing the archive and has not yet offered researchers access, we have nevertheless received approximately 400 inquiries from researchers all over the world. Some broad topics of interest expressed by researchers run from patterns in the rise of citizen journalism and elected officials' communications to tracking vaccination rates and predicting stock market activity.
The Library of Congress isn't entirely clear how the ongoing archive will be utilized, but it has issued a white paper (PDF) outlining the project.
This project, of course, is different than Twitter's recently announced initiative to make every user's full tweet history available to them. That effort is under way, though only some users have been given access to date.
Interestingly, the Library of Congress reported in the white paper that its two full copies of the entire archive of 170 billion tweets comprise about 133 Terabytes of data. Each tweet, the library wrote, contains about 50 accompanying metadata fields.