data, data, data, extracting something useful from the Big “mess”

Listening to the radio this morning, on my usual bike to the work, and couldn’t believe I missed the announcement of the release of query-able database of the so-called “The Kissinger Cables” by Wikileaks yesterday! Come on Twitteraars, I rely on you! I guess everyone was busy reacting to the Mendeley take over by Elsevier (see Menedely’s take & Elseviers statement.

We could have a long discussion about Wikileaks politics, implications of releasing and organizing such a database and so forth. If you want to talk about that I suggest you check out yesterday’s Democracy Now segment with interviews with two Wikileakers and implications of creating such a database. NOTE these ~1.7 million U.S. diplomatic and intelligence documents from 1973 to 1976 aka “The Kissinger Cables” was already publicly available via the the National Archives.

Instead, as a social scientist, that is working with Big data, I was intrigued by how databases such as these not only allowed journalists to strategically dig through the documents – with stories popping up all over the world (just do a news search on “The Kissinger Cables”); but also how the organization of such text data has great potentials for research. These potentials include a) the actual technical techniques used to organize these massive data set of text documents, b) given these boundaries, how we can operationalize the data – meaning can we thinking of it in network terms and identify relations not only through statements but by shared event presence (two mode/bipartite affiliations of individuals of entities involvements) and or other text analysis techniques that Humanities scholars work with, c) and of course how these insights contribute to the understanding of history and social phenomena of this specific period. This query-able dataset for the massive amount of text files for public use is a sign that not only the digitization of such documents are an asset to public knowledge, but also that a methodological framework needs to be developed for each unique dataset given its characteristics which only then truly unlocks the potential for knowledge in multiple domains.

Lets get digging? 🙂


research in the “eHumanities” needs social and computer scientists

Last week I attended the “Get going: The Nijmegen Spring School in eHumanities”. The school focused on three programs and or skills sets- Python, R and Gephi. I thoroughly enjoyed getting my hands dirty with Python, and can’t believe I have lived my scientific life, up ’til now, without it! I would recommend learning this language to all social scientists and humanities scholars working with Big data in particular, as it is a fairly straight forward language, with an increasing amount of online tutorials. Stop doing everything manually!, or within the boundaries of Excel and learn a tool that will speed up processes such as variable recoding.

Beyond learning these skills I also learned a lot from participants about the Humanities, as most attendees came from the Humanities, with a few exceptions of a few social scientists and computer scientsits. I myself am a trained social scientists and had little knowledge about the Humanities and the giant push for so-called eHumanities/digital Humanities and the like. This emergence of increased funding and interest within the academic community seemed to be coming from two sources: the increasing digitialization of sources used in the Humanities, combined with the lack of training (of course there are exceptions) about how to conceptualize, operationalize, and analyze such data. Let me make this clear I am NOT saying that pre-digital work or non-digital work in the Humanities is/was fruitless but rather that the field is challenged to formulate a new methodology and thus skill set for researchers. This emerged within this small group of mainly Humanities scholars during the workshop- few had experience in statistics, or operationalizing data into network terms, or thinking in such terms/schemas. I am not criticizing attendees, this was the goal of the workshop after all, but I was struck, as someone versed in these techniques/knowledge, how helpful these techniques could be and thus the truly great need to fill this knowledge gap.

Often a way to bridge this is to bring in Computer Scientists- experts in automating everything, organizing data, analyzing large data sets, modelling problems; on paper certainly the most logical step to aid Humanities scholars. But after this three day workshop I see a missing piece that I think may be essential to bridge this gap even further, and that is integrating social scientists as well. Of course, you will probably say this is self promotion, as certainly it has sparked my interested, but hear me out. Social Science as a field has a long tradition of discerning valid and reliable methodologies for analyzing all sorts of data types, origins, sample sizes and the like. There is a strong tradition among quantitative social scientists to be trained in various statistical methods that allow the questioning of causal mechanisms (relationships between sets of variables). Social Science as a field is also faced with the increasing availability of Big data and thus also are teaming up with computer scientists to expand applications. It seems quite obvious to me that there is a need for the three disciplines to address this e-fying of the Humanities to redefine the boundaries from looking at different (multidisciplinary) research questions, as well frameworks for integrating methodologies for using such data.

These discussion also challenged me to think about how I think about my data and thus research questions. Although I actively attempt to expand the reach of my disciplinary blinders through work with computer scientists in particular, I certainly now see the advantages of considering a combined Humanities approach; particularly brainstorming and thus exploring the increasing amount of data produced with the Humanities in mind.

Any suggestions about where to start?

Large Social Network Dynamics Talk + Sunbelt Presentation

If you are curious about new approaches to studying large social network dynamics, check out the confirmed upcoming talk at the International Network for Social Network Analysis Conference (Sunbelt) in Hamburg, Germany. I will be giving a talk, on behalf of my co-authors (see below), on the “Methodological Specifications for Application of the Mean-field Model for Large Scale Social Networks” in the section- Large Scale Networks Analysis on Thursday afternoon 23 May 2013. In this talk I will discuss a model we have developed for overcoming a number of limitations in presently used models to investigate large social network dynamics.

For further details, as they become available check out – the conference webpage:

Submission Summary:

Title: Methodological specifications for application of the mean-field model for large scale social networks
Author(s): Birkholz, Julie M1, Lungeanu, Alina,2, Bakhshi, Rena3, Groenewegen, Peter1, van Steen, Maarten3, Contractor, Noshir2
Institute(s): 1Vrije Universiteit Amsterdam, Network Institute, Organization Sciences, Amsterdam, Netherlands, 2Northwestern University, Evanston, IL, United States, 3Vrije Universiteit Amsterdam, Network Institute, Computer Science, Amsterdam, Netherlands

Text: The statistical modeling of the emergence of social networks is most commonly undertaken using two models: Stochastic actor-orient models (using SIENA) and p*/ERGM. However, both models have scaling limitations due to computational challenges. We propose the use of a mean-field approach to study large scale social network dynamics (1000s nodes). A mean-field model, originating from physics, enables consideration of a large number of nodes through the aggregation of classes of nodes into “nodal buckets.” The analysis then computes the interactions/communication between buckets. The mean-field model has been successfully applied to estimate attribute and network parameters on large social networks (Birkholz et al 2012). Here the nodes were aggregated into buckets based on shared attributes.

However, the inferences from such models hinges crucially on the selection of the shared attribute used for classification of nodes into buckets. To overcome this limitation, we propose a methodological specification (based on equivalence classes) for the aggregation of nodes into buckets. We apply this technique to study the the co-authorship network of 1,354 researchers in the Oncofertility scientific field over a four year period. We estimate the extent to which collaboration networks are influenced by multiple factors such as cosmopolitanism, visibility, scientific age, and institutional affiliation.


Talk at Northwestern, SONIC Lab December 2012

On Monday, Dec 17, 2012 I gave a talk at the SONIC lab at Northwestern’s Evanston Campus. The SONIC Lab (Science of Networks in Communities Lab) is a research group under the direction of Noshir Contractor that works on “social network theories, methods, and tools to better understand and meet the needs of diverse communities.”

I spoke about recent work by myself and my collaborators on the mean-field model for large social networks. It was a great opportunity to both present the model and get feedback from the SONIC group.

You can check out the talk here with slides and audio. For info about the talk see –

Studying Large Social Networks

In my research I investigate dynamics in large social networks, networks of >1000 nodes. The most commonly used models have a number of limitations for studying the combined effects of both network and social parameters on network evolution; thus a year ago I started working with Rena Bakhshi, a very talented researcher within the Network Institute. Rena had a model- the mean field model, which had used to investigate dynamics of large communication networks and wanted to experiment with social network data. This resulted in our first application of the mean-field model for large social networks, entitled – Scalable Analysis for Large Social Networks: The Data-Aware Mean-Field Approach, recently published in Social Informatics. See the abstract here.


Studies on social networks have proved that endogenous and exogenous factors influence dynamics. Two streams of modeling exist on explaining the dynamics of social networks: 1) models predicting links through network properties, and 2) models considering the effects of social attributes. In this interdisciplinary study we work to overcome a number of computational limitations within these current models. We employ a mean-field model which allows for the construction of a population-specific model informed from empirical research for predicting links from both network and social properties in large social networks.. The model is tested on a population of conference coauthorship behavior, considering a number of parameters from available Web data. We address how large social networks can be modeled preserving both network and social parameters. We prove that the mean-field model, using a data-aware approach, allows us to overcome computational burdens and thus scalability issues in modeling large social networks in terms of both network and social parameters. Additionally, we confirm that large social networks evolve through both network and social-selection decisions; asserting that the dynamics of networks cannot singly be studied from a single perspective but must consider effects of social parameters.

how do new ideas come to rise?

The emergence of new ideas and knowledge from within a community has traditionally been studied as a social process, in recent work with colleagues- Christine Moser, Dir Deichmann, Iina Hellsten, & Shenghui Wang, we have investigated the potential dual role of the content space of ideas and the social network structure in which it emerges to shed light on how different structures relate to one another. This work was presented at 2013 Hawaii International Conference on System Sciences, entitled – Exploring Ideation: Knowledge Development in Science Through the Lens of Semantic and Social Networks. Publication link coming soon!

Scientists use of the Web

Scientists are increasingly using the Web to exchange, share, and accumulate/identify knowledge. The use of the Web by scientists is a field of growing interest. Thus it made us question, who is using these Web platforms? All scientists, specific groups/ages/disciplines of science. With a group of computer scientists within the Network Institute we developed a method and tool to identify a set of known scientists to able to reflect on the representativeness of Web studies of scientists online. This work was recently presented at the Sixth Chinese Semantic Web Symposium (CSWS2012) and the First Chinese Web Science Conference (CWSC2012) in Shenzhen, China. And will be published shortly in the conference proceedings, for now you can find the publication here.