About juliembirkholz

PhD Candidate @ Network Institute, Vrije Universiteit Amsterdam, researching dynamics of collaboration in science using computational social models

“Your data is more important than your paper!”

A delayed report of the Openaire Workshop at UGent on 18 November. The https://www.openaire.eu/ is an initiative supported from the EC to support, foster, develop and explore open access in science. They have national reps in all EU member states as well as a number of non-EU member states that are responsible for disseminating info to researchers, and helping them navigate open access (OA) in general. The large majority of the workshop was recorded and can be found Openaire Workshop.

The workshop, among other things, presented the OA pilot project connected to Horizon 2020 funding. It is targeted at a number of exact science fields but all Horizon grant holders can opt in. It mandates and financially supports 3 open access publications (max 2000 euros/publication) with obligations of also preparing data management plan and putting data (or part of) in a data repository if legally able (no this does not mean that you have to share your data, nor that you give to anyone and everyone, but just that there is a record and metadata around it and not lying in your sock drawer. For example, as far as I understand, patient data would be exempt from this, but survey data not unless there was an explicit privacy/legal issue would/could be stored here).

Lots of talk then about a data management plan (thinking about your data and writing it up pre-during and post-research), and the benefits and issues around OA. And interesting to see how different unis and science systems are responding to these calls now from the bottom and top (funders).

The keynote from Lennart Martens from UGent biology, was filled with great anecdotes of data management and OA success stories. He emphasized that “Your data is more important than your paper!”, and closed with (paraphrasing) “Dragons always get lots of gold and then sleep on it, why bother getting the gold (the data) and go do something original with the gold. The metaphor is imperfect, as this pile of gold is finite, but data is not. When I use the data I do not use it up. Data is an infinite treasure.”


My punch line from the day, particularly for skeptic social science researchers: These records of data, and not the papers or articles, are a fundamental life blood of a field. For example, natural history libraries, where you have collections of natural specimens (bugs, animals, plants) are the result of data management & OA and thus future findings can be attributed to this open science policy and having good records of the data to compare specimens.  Without such initiatives these specimens would remain in the back closets of researchers, gathering dust.

What if we as social science researchers think about our research data in that way? I am not suggesting that we give for example field work data notes to just anyone, and I am not comparing the natural sciences to social sciences; obviously we have very different issues of privacy, confidential, security, etc… that don’t exist or to such a great extend in other fields. But what if failing to leave even a record of this data, through meta data (e.g. notes about our key concepts, population examined, collection strategies/technique etc..), outside of the box of notes we may carry with us or the raw data on USB sticks or external drives, outside of what gets published; is a disservice to the pursuit of science that we as researchers are responsible for?

A “living data management plan”, as Marjan Grootveld so rightly put, should certainly be part of the standard business talk in developing a project, which largely currently is seen as a sensitive topic; as it is after all core to our business.



Higher Education in the EP

This past year has had me experimenting with different types of open data in seeking to say something about policy, and policy-making and specifically higher education. On part of this ongoing work with colleagues – Jelena Brankovic and Martina Vukasovic, is the use of the Talk of Europe linked data set. Talk of Europe is linked data of the plenary debates held in the European Parliament (EP) between July 1999 and January 2014, as well as related meta data in English.

As part of this project I partook in the Talk of Europe – Travelling CLARIN Campus Creative Camp #2 in March, which sought to bring together researchers interested in social science and humanities driven questions using this data set. It was a great week of tinkering with the queries and interacting with other researchers working a plethora of research questions (see the review of the Camp). More importantly, with the great assistance of Astrid van Aggelen and Jan Wielemaker, they helped me develop a unique query that fit the needs of our research focus. That was a query that allowed us to identify speeches through the identification of different key words and retrieve the related data about the speech – date, speaker, speaker party affiliation, speaker country, etc… This allowed us to address the following:

Higher education is increasingly promoted as a policy solution to various other policy problems. The development of EU higher education policies, and linkages between higher education and other sectors in EU legislative processes are not necessarily apparent due to complex governance arrangements between the different EU institutions and the lack of data that would allow a systematic analysis. The Talk of Europe dataset offers an avenue to systematically explore how higher education is proposed in non-higher education agenda topics considering the role of a number of factors: parliamentary committees, individual members of the EP, different political parties, coalitions, and nation-states in explaining the prominence of higher education. Through a mixed-methods analysis we seek to identify patterns in the relations of higher education and non-higher education agenda topics. This research is relevant not only to the field of higher education but also in understanding the role of EP in policymaking. (adapted from our submitted abstract to Creative Camp #2 for building a SPARQL query).

Since the camp, we presented our first results at the Consortium of Higher Education Researchers (CHER) in Lisbon, Portugal September 2015. There we explored: a) the frequency of speeches on the topic of HE, b) frequency of speeches that use a HE term that are not specifically on the topic of HE, c) how these patterns are related to party and country affiliations of the speaker. This provided descriptive insights into how HE is discussed in the EP, where we found distinct patterns to develop a set of propositions. More on these specific findings once we get this paper submitted to an outlet this fall.


As the lead data scientist on this project I thought I would share some of the challenges we encountered, which I hope speaks to some common problems of social science and policy researchers w/ limited formal technical training between the use of open data and big(ger) data:

a) how do we efficiently code 10,000+ speeches? We took an approach that sought to balance distance and close readings techniques through manual coding of the speech titles, integrating some automatic techniques for identifying key words in the titles, compared to categorizing the text of the speech.

b) which platform/environment to explore and analyze this data? The shared knowledge between the team of social science and policy researchers made SPSS a first choice for data analysis. Although given the limited processing capabilities and of that environment I found myself going back in forth between a pipeline of tools that were suitable for the discipline: storing in csv/excel, managing/ordering/coding in R, some basic stats in SPSS & R, and back and forth.

c) working with bigger data. I say bigger data as this was a structured large dump of data, unlike big data – which is largely unstructured or ultimately constantly changing, and thus given this structure and uniformity we were able to implement methods that we felt we could evaluate their reliability and validity.

PUNCHLINE: Big(ger) data has great potential for policy research, although its size and the technical hurdles in collecting and in some cases analyzing it require set of expertise that is not necessarily in the traditional toolbox of social science and policy researchers = collaborate to innovate.

& Creative Camps are awesome!

2:AM conference and Altmetrics workshop

Report on the 2:AM conference and Altmetrics workshop; the entire 2:AM conference had live streaming which can be viewed now via youtube by sessions. Altmetrics are the so-called alternative metrics to science: see the short manifesto. These two events discussed alternative metrics to science. The 2:AM conference was a mix of practitioners = publishers and researchers. The workshop was an elevator pitch style conference geared at researchers, with breakout groups that lead to a set of draft papers on different topics = you brainstorm with a set of researchers on a topic that emerged during the conference and draft a paper together, thus numerous potential publishable papers emerged from these sessions.

Dominating this year’s conference was measuring impact through social media (e.g. spread of info through media, in policy papers, twitter; measuring “value”/citations through download, likes, retweets vs. formal citation). These talks included how do we define impact for science, is it different for researchers, institutions, what should funding bodies or other science orgs consider when evaluating impact. Some notable talks on this topic in particular:

  1. The conference session on impact with abstracts & see the video.
  2. The session on altmetrics in research evaluation & see the video.
  3. From the workshop: Kim Holmberg’s group (full-disclosure, also a co-author of mine) on measuring impact in science (scroll down to project and see the description of the newly funded project on measuring impact in science).
  4. Cameron Neylon’s talk on developing theory about how social media is used in science and by scientists as research evaluation metrics, he provided via Twitter slides from a different talk but relevant.
  5. Lot’s of cool stuff coming out of Altmetric, as expected, scattered throughout the event checkout the schedule.

PUNCHLINE: Even though there are few standards on how altmetrics are being/will be used in research evaluation as well as their impact and effect; as an “altmetrician” I would encourage researchers to bring their “non-traditional” outputs to the table. Mention these in the conversations with department heads, group leaders, and for departments as well as in formal evaluations, as at this point, an increasing number of people, institutions and the like are online, thus be open about how you are disseminating your work to the public. In addition these social media tools have proved to be great networking tools, in maintaining a network and in identifying interesting researcher, papers, etc.…

Explicit self-promotion – a twit pic of me presenting our work on the responses of higher education institutions in the UK, where we found exploratory evidence that less-research intensive universities have more diverse organizational responses (departments, policy papers, events, library training, etc..) than more-research intensive. We postulate that these unis may have more to gain in way of positioning themselves to prove impact. More to be seen as we take the next steps in this project to explore these responses.

Altmetrics, oh yeah social media…

I have been laying low on social media, working this past year to position myself in the field of HE through reading, networking and experimenting with “my ways” of producing knowledge tidbits. But last week I got a friendly little reminder to get back online in a formal way; and that was that I had the pleasure of attending and presenting at the 2:AM Conference and Altmetrics Workshop (see my post on the conference). The last dark years of the PhD:) = writing with blinders on to finish, have not allowed me to engage with this community as much as I would have liked; but regardless I remain to be inspired. I was able able to reflect upon this attraction following the conference to identify how and why altmetrics keeps the blood flowing in the academic chamber of my heart, at least.

Altmetrics touches on a number of my professional and personal interests. Of course the discussions of new/emerging scholar outputs is likely exciting for any academic. The call for open access is everywhere, the actual practices that we under take to do research – including reading, collecting data, analysis, but also networking, sharing info, just stimulating your brain! are intimate to our lives as researchers and they are changing where they are increasingly online or supported through online tools, cloud services, platforms. But the topic of altmetrics encompasses more specific interests of mine: a) networking in general (e.g. how can we use our networks to yield different kinds of capital, how do those network influence our access, etc..), b) science and technology in general (e.g. changing practices of knowledge production and dissemination), and c) innovation (e.g. what do we need to come to new ideas, how do we innovate). In addition, as a self-confessed data-addict – who gets giddy with the idea of any big(ger) data on social actors (individuals and institutions alike) altmetrics could just really be a girls dream date – data being generated with every click… so long story short, I am back online sharing with whomever wants to listen and be part of a growing conversation on networking for innovation, or networking for knowledge, or just networking… we will figure it out along the way.

Networks and Higher Education

Since November 2014 I have a post doctoral research post at CHEGG where I have the opportunity to research the networks of higher education institutions to understand both the antecedents and consequences of the networks that these institutions employ in implementing higher education tasks and achieving institutional goals.

Check out CHEGG and the diverse work of our research institute, here. And stay tuned for more posts about this new research venture.

Computer Science in The Netherlands Survey

This special blog post relates to a survey sent out to Dutch Computer Scientists as part of my PhD research in cooperation with the Rathenau Institute.

In my PhD research I investigate the dynamics of social networks. I specifically study one social system of social beings – scientists, to understand dynamics of knowledge systems. Scientists were selected as a population of study are there are rich publicly accessible data on scientists, whether that be from publication databases, to web profiles, and meta data from multiple Web sources. These data sources provide a rich set of information for studying dynamics of a specific system.

My work up to now has solely used publication (bibliometric data) and available meta data to study dynamics of social systems [1]. In an effort to provide more detailed insight into dynamics a survey was recently prepared asking a set of scientists, from multiple universities, positions, and experience, to provide additional data for this study. In this project one field and one national setting is investigated- Computer Science in The Netherlands. It is this set of scientists that have received an invitation to partake in a survey to collect further data on the field.
Why Dutch computer scientists? The Dutch context was selected for it is a typical European research environment with funding on multiple levels for different types of research. Notably the Dutch research environment also has a diversity of universities, from research to vocation, at which to examine different processes in a relatively small geographical space.

Why research Computer Science? The field of Computer Science was selected for three reasons: the traditions of the field with multiple sub-fields within the discipline; and the known tendency for collaboration through co-authorship; as well as the validity and reliability of online sources documenting publications.

Computer Science is a field based on both information and computation studies, coming together in the use of computers as systems and or tools for solving research problems. The discipline of Computer Science is a mature, intellectually unified field with a number of mature sub-fields existing as self-sustaining practices. Computer Science subjects range from bioinformatics, artificial intelligence/cognitive science, cybernetics, quantum computing and business applications. Consequently the field not only works on internal questions but has a tendency to work with other fields. Within the Netherlands the discipline of Computer Science is a field of high research quality, (Nationale Informaticakamer 2010).

The field of Computer Science has a number of publication databases that are internally managed. These databases allow us to make a valid selection of publications from our sample population, compared to the use of Web of Science which has acknowledged biases for this field [2]. In this study we use DataBase systems and Logic Programming (DBLP, see [3]) to acquire publication data. DBLP is server which provides bibliographic information on major Computer Science journals and conference proceedings.

Why this survey? As mentioned in the introduction, my work up to now has solely used publicly available data about scientists, I aim to combine this with information about a specific set of scientists to provide greater insights into network dynamics. This survey is in part traditional- asking for information about position and gender to explain different dynamics. But also has a social network component- asking scientists to reflect on their co-authors during a period of five years. Social network surveys require the identification of specific individuals in order to collect information on the relationships between actors [4]. The data will in no way be used for any kind of evaluation. The answers are confidential and the data are treated anonymously. Scientists are asked about individual co-authors as compiled from DBLP, but these data will be aggregated for analysis and will never be connected to individual names. Neither will the presentation of the results relate to your name or your co-authors names. Access to the collected data is only available to the PhD researcher.

What insight can be drawn from such a study? The combination of richer data, not available on the Web, allows more concrete insights into dynamics that contributes not only to scientific knowledge but also practical knowledge for the scientists themselves, of any field, but also policymakers. It allows the further extension of models that I have been using in my PhD research to contribute to knowledge on under what conditions social networks evolve.

[1] Birkholz, J.M., Bakhshi, R., Harige, R., van Steen, M., & Groenewegen, P. (2012). Scalable Analysis for Large Social Networks: the data-aware mean-field approach. Social Informatics, 406-419, see: http://arxiv.org/abs/1209.6615.
[2] Bar-Ilan, J. (2010). Web of Science with the Conference Proceedings Citation Indexes: the case of computer science. Scientometrics, 83(3), 809-824.

[3] DBLP- http://www.informatik.uni-trier.de/

[4] Wasserman, S., & Faust, K. (1994). Social network analysis: Methods and applications (Vol. 8). Cambridge university press.

data, data, data, extracting something useful from the Big “mess”

Listening to the radio this morning, on my usual bike to the work, and couldn’t believe I missed the announcement of the release of query-able database of the so-called “The Kissinger Cables” by Wikileaks yesterday! Come on Twitteraars, I rely on you! I guess everyone was busy reacting to the Mendeley take over by Elsevier (see Menedely’s take & Elseviers statement.

We could have a long discussion about Wikileaks politics, implications of releasing and organizing such a database and so forth. If you want to talk about that I suggest you check out yesterday’s Democracy Now segment with interviews with two Wikileakers and implications of creating such a database. NOTE these ~1.7 million U.S. diplomatic and intelligence documents from 1973 to 1976 aka “The Kissinger Cables” was already publicly available via the the National Archives.

Instead, as a social scientist, that is working with Big data, I was intrigued by how databases such as these not only allowed journalists to strategically dig through the documents – with stories popping up all over the world (just do a news search on “The Kissinger Cables”); but also how the organization of such text data has great potentials for research. These potentials include a) the actual technical techniques used to organize these massive data set of text documents, b) given these boundaries, how we can operationalize the data – meaning can we thinking of it in network terms and identify relations not only through statements but by shared event presence (two mode/bipartite affiliations of individuals of entities involvements) and or other text analysis techniques that Humanities scholars work with, c) and of course how these insights contribute to the understanding of history and social phenomena of this specific period. This query-able dataset for the massive amount of text files for public use is a sign that not only the digitization of such documents are an asset to public knowledge, but also that a methodological framework needs to be developed for each unique dataset given its characteristics which only then truly unlocks the potential for knowledge in multiple domains.

Lets get digging? 🙂

research in the “eHumanities” needs social and computer scientists

Last week I attended the “Get going: The Nijmegen Spring School in eHumanities”. The school focused on three programs and or skills sets- Python, R and Gephi. I thoroughly enjoyed getting my hands dirty with Python, and can’t believe I have lived my scientific life, up ’til now, without it! I would recommend learning this language to all social scientists and humanities scholars working with Big data in particular, as it is a fairly straight forward language, with an increasing amount of online tutorials. Stop doing everything manually!, or within the boundaries of Excel and learn a tool that will speed up processes such as variable recoding.

Beyond learning these skills I also learned a lot from participants about the Humanities, as most attendees came from the Humanities, with a few exceptions of a few social scientists and computer scientsits. I myself am a trained social scientists and had little knowledge about the Humanities and the giant push for so-called eHumanities/digital Humanities and the like. This emergence of increased funding and interest within the academic community seemed to be coming from two sources: the increasing digitialization of sources used in the Humanities, combined with the lack of training (of course there are exceptions) about how to conceptualize, operationalize, and analyze such data. Let me make this clear I am NOT saying that pre-digital work or non-digital work in the Humanities is/was fruitless but rather that the field is challenged to formulate a new methodology and thus skill set for researchers. This emerged within this small group of mainly Humanities scholars during the workshop- few had experience in statistics, or operationalizing data into network terms, or thinking in such terms/schemas. I am not criticizing attendees, this was the goal of the workshop after all, but I was struck, as someone versed in these techniques/knowledge, how helpful these techniques could be and thus the truly great need to fill this knowledge gap.

Often a way to bridge this is to bring in Computer Scientists- experts in automating everything, organizing data, analyzing large data sets, modelling problems; on paper certainly the most logical step to aid Humanities scholars. But after this three day workshop I see a missing piece that I think may be essential to bridge this gap even further, and that is integrating social scientists as well. Of course, you will probably say this is self promotion, as certainly it has sparked my interested, but hear me out. Social Science as a field has a long tradition of discerning valid and reliable methodologies for analyzing all sorts of data types, origins, sample sizes and the like. There is a strong tradition among quantitative social scientists to be trained in various statistical methods that allow the questioning of causal mechanisms (relationships between sets of variables). Social Science as a field is also faced with the increasing availability of Big data and thus also are teaming up with computer scientists to expand applications. It seems quite obvious to me that there is a need for the three disciplines to address this e-fying of the Humanities to redefine the boundaries from looking at different (multidisciplinary) research questions, as well frameworks for integrating methodologies for using such data.

These discussion also challenged me to think about how I think about my data and thus research questions. Although I actively attempt to expand the reach of my disciplinary blinders through work with computer scientists in particular, I certainly now see the advantages of considering a combined Humanities approach; particularly brainstorming and thus exploring the increasing amount of data produced with the Humanities in mind.

Any suggestions about where to start?

Large Social Network Dynamics Talk + Sunbelt Presentation

If you are curious about new approaches to studying large social network dynamics, check out the confirmed upcoming talk at the International Network for Social Network Analysis Conference (Sunbelt) in Hamburg, Germany. I will be giving a talk, on behalf of my co-authors (see below), on the “Methodological Specifications for Application of the Mean-field Model for Large Scale Social Networks” in the section- Large Scale Networks Analysis on Thursday afternoon 23 May 2013. In this talk I will discuss a model we have developed for overcoming a number of limitations in presently used models to investigate large social network dynamics.

For further details, as they become available check out – the conference webpage: http://hamburg-sunbelt2013.org/

Submission Summary:

Title: Methodological specifications for application of the mean-field model for large scale social networks
Author(s): Birkholz, Julie M1, Lungeanu, Alina,2, Bakhshi, Rena3, Groenewegen, Peter1, van Steen, Maarten3, Contractor, Noshir2
Institute(s): 1Vrije Universiteit Amsterdam, Network Institute, Organization Sciences, Amsterdam, Netherlands, 2Northwestern University, Evanston, IL, United States, 3Vrije Universiteit Amsterdam, Network Institute, Computer Science, Amsterdam, Netherlands

Text: The statistical modeling of the emergence of social networks is most commonly undertaken using two models: Stochastic actor-orient models (using SIENA) and p*/ERGM. However, both models have scaling limitations due to computational challenges. We propose the use of a mean-field approach to study large scale social network dynamics (1000s nodes). A mean-field model, originating from physics, enables consideration of a large number of nodes through the aggregation of classes of nodes into “nodal buckets.” The analysis then computes the interactions/communication between buckets. The mean-field model has been successfully applied to estimate attribute and network parameters on large social networks (Birkholz et al 2012). Here the nodes were aggregated into buckets based on shared attributes.

However, the inferences from such models hinges crucially on the selection of the shared attribute used for classification of nodes into buckets. To overcome this limitation, we propose a methodological specification (based on equivalence classes) for the aggregation of nodes into buckets. We apply this technique to study the the co-authorship network of 1,354 researchers in the Oncofertility scientific field over a four year period. We estimate the extent to which collaboration networks are influenced by multiple factors such as cosmopolitanism, visibility, scientific age, and institutional affiliation.