A Chronicle of Discovery in Four Essays
Semester 1: Exploring
What is linguistics? What fields and applications are of interest to you?
As I came into the M.S. program after twelve years of being away not only from an academic setting but from linguistics, I had some preconceived notions about what my experience would be like and the range of subfields I would touch upon in my journey. In undergraduate school, at the University of Florida, I took no Computational Linguistics courses at all (in fact, I would have been hard-pressed to define Computational Linguistics at that point, not having acquired any technical skills in high school), and was fortunate enough to study under several noted professors (including D. Gary Miller, noted Indo-Europeanist) who took a rigorous approach to exploring historical data.
Over the intervening years, I developed a strong interest in artificial intelligence (both in the fantastical world of science fiction and in reality), which I did not at the time associate with my much-loved undergraduate studies. Eventually, as personal digital assistants like Siri and Alexa became widespread and knowledge of their common foibles became equally widespread jokes, I realized that linguistics – specifically computational linguistics – was the field working to improve these systems and make them communicate in a more “human” manner. Having always wanted to work on artificial intelligence in some capacity and already having a strong background in linguistics, I was powerfully compelled to begin the journey of training to be a researcher with the knowledge and experience to address some of these issues.
I didn’t know exactly what to expect from my new studies, but I knew that they would begin to prepare me to think like a researcher. Dr. Zhang’s Research Methods and Corpus Linguistics classes did not disappoint. At this point in my journey, I didn’t have the programmatic foundation that would be a necessary skillset for conducting computational research, but in Research Methods and Corpus Linguistics we were introduced to a wide range of FLA literature as well as a wide range of corpora and tools for processing those corpora. This got me thinking like a graduate student — what is the research question I want to ask? How can I frame it and design an experiment in such a way as to answer it thoroughly and satisfactorily?
My initial attempts into study design and execution (the corpus study I conducted this semester was an examination of the Contemporary Corpus of American English in which I attempted to answer the question “Has the news really gotten more negative?”) were carefully and thoroughly considered, but of course there were still significant methodological flaws that would require ironing out in a future iteration. This experience initiated me into a core aspect of being a researcher: you are always improving, there will always be future work, and your research will never be at a perfect end state. You will never know everything — the biggest blessing in disguise, because it means that there is always more to learn.
Semester 2: Focusing
Which topics will you investigate in detail now and in your career?
In Dr. Palmer’s Computational Linguistics course, this changed. The frequent tutorials and exercises meant that we were getting enough background and experience in Python to allow us to synthesize this experience and tackle a larger project of our own design. For me, this educational setting supplied some finer points of organization and focus that were previously missing. I had long wanted to be a “real programmer,” but it was precisely the unfocused nature of this goal and the seemingly endless self-teaching resources out there that led to this goal remaining unfulfilled for many years. How does one choose between these many resources and decide upon a course of self-study when an individual goal or motivation isn’t even in focus? Those conditions were all different this time, and I quickly built confidence in my programming skills. It was a beginner’s confidence, of course, that is easy to feel when one has no idea of the jump in difficulty that lurks around the corner, but it is an important stage of growth as a developer. I am grateful that I got to learn in such a supportive and well-structured environment.
The end-of-semester project for this course could be considered the first real computational project I undertook, and its successes and areas for improvement were enlightening. A friend, whose primary area of interest is internet language, and I undertook a project that was an outwardly-playful yet serious in its goal of discovery experiment into register transformation. We wrote a program that replaced only the dialogue in our input fiction text of choice with something akin to txtspeak; our system combined a dictionary replacement with sentiment analysis (which decided which emoticon to add to each line of dialogue, if any) and other surface-level text transformations. This system yielded results both hilarious and confusing, and experientially taught us a few important NLP truths. The first of these was that extant NLP systems, even highly developed ones, are incredibly dependent upon their training – or input – data. The input text for our system was Pride and Prejudice, and the system did not perform well on anything that strayed outside this narrow register (“Emma” and “Sense and Sensibility” were fine, but anything outside this limited purview was not). This introduced us in a controlled way to the scope and type of NLP challenges, which in future semesters we would learn to place into context in the field.
Semester 3: Investigating
What challenges exist in planning and implementing your research?
This semester, my third, combined a rapid initiation into graduate-level Computer Science courses with a results- and processes-driven focus on data: how to preprocess it, how to approach it with statistical rigor, and how to accurately interpret the results of computational experiments. Whereas in previous semesters, I had been given an introduction to statistical methods of analyzing linguistic data using popular software packages like SPSS, in my Information Science courses I built on both my programmatic foundation and my statistical one to perform statistical analyses on real-life datasets using many of the data science packages that Python has to offer.
At this juncture, we were beginning to explore the shape that our research might take, largely through the examination of the shared tasks that were slated to be part of SemEval-2020, a group competition to design and create the best-performing machine learning system in the pursuit of solving one of many emergent problems in computational linguistics. A major benefit of signing onto a shared task, especially as a graduate-level researcher with limited time and experience, is that task-relevant data is already collected, vetted, and preprocessed, saving possibly hundreds of hours in data collection and curation. The content and scope of the task is also set, having already been identified as an important problem in the field and the specific tasks having been identified as effective ways to address this problem — this allows novice researchers to match their ideas and passions with an initiative that will forward computational discourse as a field, and to work alongside much more experienced individuals and teams. I was immediately drawn toward a task that sought to automatically identify propagandistic discourse when it is used in American news articles, and to label the identified fragments with one of fourteen rhetorical technique labels.
This task was personally galvanizing to me for a multitude of reasons. Among them is that the exercise of institutional power through discourse, especially as it applies to political and religious organizations, has always been a topic of significant personal interest to me. Additionally, the subtlety of the task was appealing to me from an artificial intelligence standpoint. Current artificial intelligences are limited in the degree to which they can understand and use speech acts at this level of abstraction, so collective success in this task would be an oblique “win” for computational pragmatics and therefore for the ability of artificial intelligence to communicate in a more human-like way.
This semester was also a proving ground for the more advanced algorithmic implementation that we were asked to learn in Natural Language Processing, which was taught through the Computer Science department. Not only did we get explicit instruction in the primary tasks of NLP, we were also asked to implement several algorithms from scratch (whereas normally this would be the domain of one of the many methods available in Python-based NLP packages such as spaCy). This was a sink-or-swim learning situation, and gratefully I was able to sharpen my programming skills to a significant degree. This was a critical part of my journey, since without experience with tasks of the depth and breadth of those in NLP, I might not have been able to succeed in my research.
Semester 4: Writing
What process did you follow to write up your results?
My research, which might have been spread across an academic year in different circumstances, has been condensed into the past three months for a number of reasons. Participating in SemEval-2020 has necessitated a later start — the training and development data for the twelve tasks was released at organizer-determined dates, many of which were as recently as November and December. The other practical concern was that building two machine learning systems, troubleshooting, and refining these systems was a development task that was beyond my ability to complete even a month earlier, since I had not yet completed the advanced practice with data types and control flow that courses like NLP facilitated. The path that the M.S. curriculum followed meant that for those of us in the Computational Linguistics track, our last semester would be dedicated both to building our systems and to writing up the results.
Programming for the first part of my system took place from late December through late February, and the second subtask was created from late February through mid-March. While I had grown immensely as a programmer in the previous year, this was still my first foray into building my own machine learning system, meaning that more time than desired was spent debugging and less time than hoped was spent on detailed linguistic feature analysis. It is a rare project that goes perfectly according to plan, however, and despite the debugging setbacks, my results were informative and interesting. This project helped propel me to a place where I knew enough about the field and had the requisite skills to begin producing serious research in computational discourse and contributing to the field should I choose to go that route. I feel prepared as a professional and as someone with experience in thinking like a developer to solve real issues both in academic and in industry applications.
Dr. Kasicheyanula’s Computational Linguistics II course, which was conducted as a seminar in which we presented three recent ACL papers, critically analyzed three more, and had the pleasure of watching our classmates do the same, was one of the pillars of support in the writing process. We were exposed in detail to a varied and recent body of literature in the field for which we were writing, which helped us set the tone of our own pieces and gave us ideas about how to report the results of our systems (which aspects to emphasize, what parts of the results were usually reported visually, etc.). This course also helped us become part of our chosen field by allowing us to contribute to the ongoing discourse within it.
After four semesters, I honestly feel as though I am just beginning, and have finally attained the academic and professional standard that I initially sought. Learning is a lifelong process, and this part of my journey has been an invaluable contribution to this process.