Action Search: Spotting Actions in Videos and Its Application to Temporal Action Localization

Humam Alwassel, Fabian Caba Heilbron, and Bernard Ghanem

THURSDAY, SEPTEMBER 8, 2022 •

Number of Passes: ✅✅✅

Why this paper?

Since the completion of the v1 of the WebShop project, I've been thinking more about settings where web agents studying human interaction patterns can help automate or solve practical tasks that, for humans, are tedious or requires cognition. For context, WebShop puts forth a shopping task, where given a natural language query and a catalog of products, the task worker is asked to find a product they deem the best match with the instructions. We could feasibly apply this framework to tasks in other real, text-rich settings, such as finding recipes (yelp.com) or booking travel plans (expedia.com). The tasks put forth in the World of Bits paper are good inspiration for this direction.

While it'd be an interesting engineering task, from a research perspective, I feel like the delta (a.k.a. novelty) of such a project isn't that big. For such settings, the core problem is essentially identical. The sole change would be a domain shift in dataset and NL queries, which can be web scraped and crowdsourced following WebShop's data collection procedure. The nature of queries across different domains could make for interesting variations in the information a query contains, and consequently, the kind of searching behavior exhibited by human task workers. However, my current guess is that such differences would be quite nuanced.

Following this train of thought, I've recently been thinking about tasks in different data modalities, particularly vision. An insight I developed while working on the WebShop project is that for a task to be fertile ground for designing agents that are transferrable to real world settings and therefore useful for solving practical tasks, it is important for the task's dataset and environment to be grounded in a realistic context (This was one of the core inspirations for WebShop; many prior benchmarks for designing grounded language agents were founded on synthetic data).

What if we examine traditional vision tasks through this set of lenses? I decided to read this paper because, among the variety of tasks for vision models, temporal localization stood out to me as one that requires a fair amount of human cognition to perform. What's more is that unlike visual Q&A or object detection, where a human essentially just names what's in an image or frame, solving temporal localization requires searching a video. I decided to read this paper to better understanding the current state of affairs for video localization benchmarks and models, with the overarching question of whether existing localization benchmarks are conducive to designing models that can be transferred to real world settings. Of the wealth of research in the direction of video localization, this one piqued my interest because the authors' approach in reducing the number of frames to view is derived from how humans might perform the localization task.

Context

The task of temporal localization for videos is defined as follows: Given a video and a language query, return a "moment", which is defined as a [start timestamp, end timestamp] pair. The THUMOS14 and ActivityNet benchmarks are popular for evaluating models on this task. At the time of this paper, prior video localization models were designed to read in a window of frames and output a confidence regarding how close those 2-3 seconds of video aligned with the requested query. This model is then applied in a brute force search fashion, scanning windows of frames across the entire video with a stride of 32 to 64 frames. This is not ideal because the model is required to process the entire video, with a large percentage of frames repeatedly viewed at different strides. The authors recognize this, stating:

The large body of work on temporal action localization has mostly focused on improvements in detection performance and/or speed, while very few works have targeted the development of efficient search mechanisms.

An intuitive alternative approach to the brute force search the authors arrive upon is derived from human observation:

We take notice of how humans approach the problem... [W]e show part of a search sequence a human observer carries out when asked to find the beginning of a Long Jump action in a long video. This sequence reveals that the person can quickly find the spotting target (in 22 search steps) without observing the entire video, which indicates the possible role temporal context plays in searching for actions. In this case, only a very small portion of the video is observed before the search successfully terminates.

This observation forms the basis of the authors' contributions in this paper.

Problem Statement

Goal: Develop a more efficient search mechanism by mimicking how humans jump around in this task.

The authors are not
• Proposing a new localization model or improving an existing localization model
• Proposing a variation to the localization task
• Proposing a new benchmark or dataset for evaluating models against the localization task

Accomplishing this goal is divided into two parts:

  1. Gathering trajectories capturing how humans perform the localization task on THUMOS14 and AVA. => Human Searches dataset
  2. Apply "action spotting" approach to localization task, with the goal of achieving comparable performance to SOA architectures while reducing the number of frames viewed => LSTM where
    • -> Input: <LSTM Hidden State, Current Frame, Current Timestamp>
    • -> Output: Next Timestamp
  3. Evalute efficiency of action spotting approach by
    • -> Compare "search" component: Action spotting against rule-based baselines
    • -> Compare "localize" component: Action spotting + action classifier against existing localization models
Notes

The justification and execution of the collection of human trajectories are novel and sound to me. I don't have too many questions regarding the paper's work; rather, I think this paper is interesting in that their discussion of the current state of temporal localization approaches for motivating their own contributions highlights some potential future directions. Following the ground work laid out by the initial authors, here are some of the things they mentioned that caught my eye.

Length of a Search Trajectory: A recurring theme in the paper seems to be the "number of hops" that a human takes to find the approximate area where the desired action takes place. There are a couple aspects of the collection and evaluation process that got me thinking.

To collect "Human Searches" the dataset, MTurk workers are presented with one of two tasks to be performed on either the AVA or THUMOS dataset.

We investigate two variants of the task: (i) a single class search to find one instance of a given action class and (ii) a multiple class search, which asks Turkers to find one instance from a larger set of action classes... As compared to single class search, we find that Turkers observe 190% and 210% more frames when asked to find an action instance among 10 and 20 action classes, respectively.

However, when the authors train the model on these search trajectories, they only use a subset as described in this following quotation:

To train our Action Search model, we use the THUMOS14 searches dataset described in Section 3 (discarding the search sequences with less than 8 search steps)

I'm a bit confused by this decision. Why wouldn't such trajectories be included for training? It is ground truth for human behavior on this task, and there's no mention of these trajectories potentially being the result of sloppiness or human error, so why throw them out? Following this thread of filtering I didn't understand, another similar decision that stuck out to me was that not all the action classes from the AVA dataset was used when collecting search trajectories. The authors chose to collect trajectories only for action classes that have 1. enough training videos and 2. the occurrences of the action in the video is sparse.

My guess is that the goal of these decisions was to encourage the trajectories to reflect "exploration" that humans perform; this in turn would be more useful training data for a model to imitate. An example of the kind of trajectory (that I'm guessing) the authors don't want: if a task worker gets "lucky" and, on the first hop to a random point in the video, finds the goal action, this one-click-and-done trajectory does not have much exploratory behavior that a model could pick up on. However, this makes me wonder, with all this curation in place, is the model actually learning to pick the next frame via past frames' content? Or is it learning to apply a human based search routine per action category? The following quotation is with regards to how model inference is performed when evaluating on the test set. I'm not sure whether the following line ties into the aformentioned reasoning, but it has me again wondering about what the model is actually learning:

Each search is run for a fixed number of steps... [W]e prefer to launch many short search sequences as opposed to few long ones, since LSTM states tend to saturate and become unstable after a large number of iterations.

Why a fixed number of steps? During inference, the first timestamp the model views is initialized as a random timestamp in the given video. When the query is a simple "find me <action>", if the randomly initialized frame happens to be the correct action, why wouldn't the model just stop then and there? I'm just not confident that the above decisions are geared towards imitating human search behavior with a model; some of these feel borne out of workarounds to get the model to run.

I think the Human Searches dataset is a great stepping stone for models that imitate human search for action spotting. However, I question 1. how comprehensive the dataset is when it comes to capturing a large range of human searching behaviors for temporal localization and 2. whether a model trained on such data is actually learning to determine the next timestamp to hop to from previously viewed frames.

Search Strategy: This paper introduces a new consideration when solving the localization problem. I think the below quote is a great tl;dr of the approach they're putting forth:

Thus, one may view the Action Search model as a random sampler with a smart local search: the first search steps are a random sampling of the video (exploration), while the later search steps are fine-grained steps (local search) that rely on the temporal context accumulated throughout the search.

Is the LSTM approach the best? I don't have the answer, and I think that may be because there isn't a good existing benchmark to gauge search approaches. A reduction of frames viewed from 100% to 17% is very impressive, but on a THUMOS14 dataset with 30 second videos, that represents 25 seconds saved. This is relatively significant, but in a world where the average YouTube video is 11 minutes long, it's the long video settings where the absolute gain of such an exploratory approach would be most significant and useful.

If I were to hypothesize, I think retaining the same efficiency wins that action search has demonstrated for THUMOS14 in a long video setting would require more than just increasing compute capacity. In a long video setting, there's much more context, which likely translates to many more hops if a human were the perform the localization task. Does this approach scale to longer trajectories? Let's say we train on a larger dataset of search trajectories across a greater diversity of videos. What happens then?

A synthesis of this sub-section in one sentence: I agree with the premise that a more efficient search mechanism = reducing number of frames to view = more efficient localization. My belief is that a more efficient search is truly useful if it works for long videos too, but 1. I'm not confident that the LSTM approach can do this and 2. I don't think there's an off-the-shelf benchmark at the moment that I can use to verify this.

Questions

• Robustness to video length / number of hops: The LSTM model seems to be trained and inferred with a very specific set of iterations in mind. What happens if the model is run on a longer video? What if the model is trained on not just 6+ trajectories, but all trajectories?
• Is the model actually learning to determine where to hop to based on the contents of the video? Or is it learning to mimic human search patterns on a particular task? For instance, if I have a 8 second video where the goal action is performed at the 3 second mark, it should only take at most 1-2 hops for the LSTM model to arrive at the solution. However, given that the LSTM is trained on trajectories with exclusively 8+ hops, then evaluated with a fixed number of hops, I'm not sure the model is actually picking up on the core goal of determining which timestamp to pick based on the frames watched so far.
• Can we recast the search problem approach in the form of another learning framework that can generalize to more videos?

Looking Forward

• Read the THUMOS14 dataset paper. I don't have a great understanding of the nuances of building a video localization task, and I should take a look to better understand what makes this the gold standard for this task for the past several year.
• Read about follow up work that extend on the action spotting search idea. I'm interested in seeing if there's more exploration into search mechanisms that are more flexible in the data used for training and inference
• Read about more RL oriented approaches to video exploration. Maybe RL + video localization might be a more scalable and flexible approach that has the "exploration" aspect the authors discuss baked into it.