This past year has had me experimenting with different types of open data in seeking to say something about policy, and policy-making and specifically higher education. On part of this ongoing work with colleagues – Jelena Brankovic and Martina Vukasovic, is the use of the Talk of Europe linked data set. Talk of Europe is linked data of the plenary debates held in the European Parliament (EP) between July 1999 and January 2014, as well as related meta data in English.
As part of this project I partook in the Talk of Europe – Travelling CLARIN Campus Creative Camp #2 in March, which sought to bring together researchers interested in social science and humanities driven questions using this data set. It was a great week of tinkering with the queries and interacting with other researchers working a plethora of research questions (see the review of the Camp). More importantly, with the great assistance of Astrid van Aggelen and Jan Wielemaker, they helped me develop a unique query that fit the needs of our research focus. That was a query that allowed us to identify speeches through the identification of different key words and retrieve the related data about the speech – date, speaker, speaker party affiliation, speaker country, etc… This allowed us to address the following:
Higher education is increasingly promoted as a policy solution to various other policy problems. The development of EU higher education policies, and linkages between higher education and other sectors in EU legislative processes are not necessarily apparent due to complex governance arrangements between the different EU institutions and the lack of data that would allow a systematic analysis. The Talk of Europe dataset offers an avenue to systematically explore how higher education is proposed in non-higher education agenda topics considering the role of a number of factors: parliamentary committees, individual members of the EP, different political parties, coalitions, and nation-states in explaining the prominence of higher education. Through a mixed-methods analysis we seek to identify patterns in the relations of higher education and non-higher education agenda topics. This research is relevant not only to the field of higher education but also in understanding the role of EP in policymaking. (adapted from our submitted abstract to Creative Camp #2 for building a SPARQL query).
Since the camp, we presented our first results at the Consortium of Higher Education Researchers (CHER) in Lisbon, Portugal September 2015. There we explored: a) the frequency of speeches on the topic of HE, b) frequency of speeches that use a HE term that are not specifically on the topic of HE, c) how these patterns are related to party and country affiliations of the speaker. This provided descriptive insights into how HE is discussed in the EP, where we found distinct patterns to develop a set of propositions. More on these specific findings once we get this paper submitted to an outlet this fall.
As the lead data scientist on this project I thought I would share some of the challenges we encountered, which I hope speaks to some common problems of social science and policy researchers w/ limited formal technical training between the use of open data and big(ger) data:
a) how do we efficiently code 10,000+ speeches? We took an approach that sought to balance distance and close readings techniques through manual coding of the speech titles, integrating some automatic techniques for identifying key words in the titles, compared to categorizing the text of the speech.
b) which platform/environment to explore and analyze this data? The shared knowledge between the team of social science and policy researchers made SPSS a first choice for data analysis. Although given the limited processing capabilities and of that environment I found myself going back in forth between a pipeline of tools that were suitable for the discipline: storing in csv/excel, managing/ordering/coding in R, some basic stats in SPSS & R, and back and forth.
c) working with bigger data. I say bigger data as this was a structured large dump of data, unlike big data – which is largely unstructured or ultimately constantly changing, and thus given this structure and uniformity we were able to implement methods that we felt we could evaluate their reliability and validity.
PUNCHLINE: Big(ger) data has great potential for policy research, although its size and the technical hurdles in collecting and in some cases analyzing it require set of expertise that is not necessarily in the traditional toolbox of social science and policy researchers = collaborate to innovate.
& Creative Camps are awesome!