Human Language Understanding & Reasoning
Why this paper?
I am new to the field of NLP. Diving into a new field and gaining a lay of the land regarding current research from scratch is a skill that I feel is important, that I'm currently not great at doing, and that I want to improve in. When faced with a similar situation in the past, I've greatly appreciated high level, holistic perspectives that tie together prior work with the current state of affairs. I very much enjoyed this read. This article was published in Daedalus, a journal by the American Academy of Arts and Sciences. It is not a traditional research paper; instead, Prof. Christopher Manning presents a brief history of NLP along with his insights into what the future of NLP and foundation models hold. For me, it fits the overview papers that I like and got me thinking about where my work might hopefully fall in the context of the bigger picture 🤞.
Due to the nature of this article, I decided to do away with my traditional paper summarization template and write notes on what observations and thoughts popped into my head.
Notes
• Four eras of NLP
The opening section of this article got me thinking about my own "encounters" with the stages of NLP and the transitions from one era to the next.
I used to volunteer at the Computer History Museum, and during a seasonal exhibit on "intelligent systems", the CHM had SHRDLU by Terry Winograd running on a meticulously maintained PDP-6 computer. Recently, a research team at Google released their work on Interactive Language: Talking to Robots in Real Time, which to me, shares striking similarities with SHRDLU, particularly the main problem of manipulating 3D shapes via a dialogue setting. Fifty years later and the intrigue of the primary task has stood the test of time. However, underneath the hood, the drastic shift in the approach to this task is quite amazing, and I think puts into perspective how drastic the paradigm shift occurring in the past five years has been. While it's mainly an observation, it's amazing how:
Traditional natural language processing models were elaborately composed from several usually independently developed components... In the last few years, companies have started to replace such traditional NLP solutions with LPLMs
During a past internship, my mentor was formerly a PhD candidate at Johns Hopkins University studying NLP in the early 2010s, working on parsing and syntax based machine translation, constructing one-to-one models founded on putting the grammar and linguistic facets of languages to use. The discussion of the shift from era three to four made me think of that conversation. While by no means unimpressed with how far NLP has come in the past ten years, my mentor was light-heartedly peeved at how the advent of transformers and neural networks had quickly left a body of linguistic-based techniques in the dust. (More thoughts on this in the next bullet point)
As an undergrad, I took UC Berkeley's Info 159 NLP course, where in 2018, the meatiest portion of class content was still focused on supervised labeling tasks with N-gram's and syntax/grammer based approaches that sought to utilize inherent language structure. While portions of this content remains, the 2020 version of the class is a significant update, ushering in new sections on transformer and encoder/decoder based models (while still arguably not being up to date. For instance, foundation models are not discussed at all).
• Describing Meaning
The dominant approach to describing meaning... is a denotational semantics approach or a theory of reference: the meaning of a word, phrase, or sentence is the set of objects or situations in the world that it describes
This appears to have been the pre-dominant approach before the transformer architecture and attention layer heralded the current wave of statistical contextualization that empirical models employ:
This contrasts with the simple distributional semantics (or use theory of meaning) of modern empirical work in NLP, whereby the meaning of a word is simply a description of the contexts in which it appears.
The paradigm shift from one to the other seemingly puts them in contrast. However, when taking a step back and thinking about meaning -- done thoughtfully in the following paragraph with the discussion of inferring the meaning of shehnai -- my three high level thoughts were:
- That meaning in different forms can be derived from both perspectives.
- That different kinds of meaning vary in their utility for different tasks.
- That a model can benefit from assimilating both kinds of meaning.
The third point is what I'm most interested in. If a fifth era in NLP awaits us, I wonder if that is the era where both methodologies are incorporated into a model's learning process. One observation is that:
For common traditional NLP tasks... the best current systems are again based on LPLMs, usually finetuned by providing a set of examples labeled in the desired way
But is a traditional supervised task the only way to provide feedback? With the advent of prompting -- provide a natural language description or examples and a foundation model can do very well on tasks it wasn't explicitly trained on -- evidently not! Prompting is all the rage these days, so what does that tell us about what and how foundation models are learning from such fine tuned instructions (Note: I'm sure there's work that addresses this, will need to read up on it)? Perhaps there is value in work that brings back recently abandoned theory-of-reference techniques as a form of feedback that can help models learn meaning that would be otherwise very difficult or laborious to identify in a purely neural, empirical training setting.
• Enriching Meaning
The success of LPLMs on language-understanding tasks and the exciting prospects for extending large-scale self-supervised learning to other data modalities suggests exploring a more general direction... [T]he most exciting and promising direction is to build foundation models that also take in other sensory data from the world to enable integrated, multimodal learning
Multimodality has been an active area of research for some time, and it seems a natural next step for LPLMs. The allure of an agent that can navigate through multiple mediums is intuitively enciting, but what groundwork is needed to realize these visions?
The CLIP neural network and Dall-E 2 have receive a lot of coverage, with some exciting deliverables that have spurred a whole set of subsequent papers developing text-to-<something> generative models. But if we are to design multimodal agents that can act on real tasks, what are the tasks in the first place? CLIP was benchmarked against other vision models on traditional vision tasks. While CLIP's admirable performance speaks to the merit of grounding knowledge across multiple modalities, in terms of evaluation, traditional vision tasks are arguably not a true reflection of a task that requires multimodal data. What I mean is a task that requires information that can be derived from, for instance, a combination of images and text, but not from either source alone. MultiBench is an interesting first attempt at creating multimodal datasets geared towards multimodal tasks, but in the wake of this work, the question still remains as to what is a worthwhile multimodal task. Assimilating information from multiple modalities is one worthwhile research direction, but I'm curious what kind of progress multimodal tasks can tease out with regards to model design and evaluation.