data, data, data, extracting something useful from the Big “mess”

Listening to the radio this morning, on my usual bike to the work, and couldn’t believe I missed the announcement of the release of query-able database of the so-called “The Kissinger Cables” by Wikileaks yesterday! Come on Twitteraars, I rely on you! I guess everyone was busy reacting to the Mendeley take over by Elsevier (see Menedely’s take & Elseviers statement.

We could have a long discussion about Wikileaks politics, implications of releasing and organizing such a database and so forth. If you want to talk about that I suggest you check out yesterday’s Democracy Now segment with interviews with two Wikileakers and implications of creating such a database. NOTE these ~1.7 million U.S. diplomatic and intelligence documents from 1973 to 1976 aka “The Kissinger Cables” was already publicly available via the the National Archives.

Instead, as a social scientist, that is working with Big data, I was intrigued by how databases such as these not only allowed journalists to strategically dig through the documents – with stories popping up all over the world (just do a news search on “The Kissinger Cables”); but also how the organization of such text data has great potentials for research. These potentials include a) the actual technical techniques used to organize these massive data set of text documents, b) given these boundaries, how we can operationalize the data – meaning can we thinking of it in network terms and identify relations not only through statements but by shared event presence (two mode/bipartite affiliations of individuals of entities involvements) and or other text analysis techniques that Humanities scholars work with, c) and of course how these insights contribute to the understanding of history and social phenomena of this specific period. This query-able dataset for the massive amount of text files for public use is a sign that not only the digitization of such documents are an asset to public knowledge, but also that a methodological framework needs to be developed for each unique dataset given its characteristics which only then truly unlocks the potential for knowledge in multiple domains.

Lets get digging? 🙂


open data everywhere

Reading through my tweets this afternoon during a break from re-reading Karin D. Knorr-Cetina‘s “The Manufacture of Knowledge”, to aid in conceptualizing some processes in science for the first draft of my first chapter of my dissertation (a nice read about the sociology of science), AND saw this tweet from “Submitted to Slovenian Ministry of Higher Education, Science and Technology” eIFL and SPARC, two organizations working to advocate change in different research processes, released a statement that supports the Slovenian Ministry of Higher Educations, Science and Technology proposal to create “a national open data and open publication infrastructure and mandatory deposition of publicly funded data and publications”. What great news!

Open data is popping up everywhere:). What is so interesting about open data? Check out this video from the Web Wide Web Foundation about open data. Check out these (this is not an exhaustive list, just some samples collected from tweet links from my colleagues at the VU Amsterdam – Frank van Harmelen:!/FrankVanHarmele, Paul Groth:!/pgroth; to give you an idea about what is happening in regards to open data):

Open Knowledge Foundation’s European-level data registry-  and some technical info about it.

UKs open data on government spending –

Europeana making meta data available –

The NYTimes also has linked open data –

India’s also going “open”.

Might also want to check out – What’s data linking? – checkout a short intro here.