Trajeglish: Traffic Modeling as Next-Token Prediction (2024)

\sidecaptionvpos

figurec

Jonah Philion1,2,3123{}^{1,2,3}start_FLOATSUPERSCRIPT 1 , 2 , 3 end_FLOATSUPERSCRIPT, Xue Bin Peng1,414{}^{1,4}start_FLOATSUPERSCRIPT 1 , 4 end_FLOATSUPERSCRIPT, Sanja Fidler1,2,3123{}^{1,2,3}start_FLOATSUPERSCRIPT 1 , 2 , 3 end_FLOATSUPERSCRIPT
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTNVIDIA, 22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTUniversity of Toronto, 33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTVector Institute, 44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPTSimon Fraser University
{jphilion, japeng, sfidler}@nvidia.com

Abstract

A longstanding challenge for self-driving development is simulating dynamic driving scenarios seeded from recorded driving logs.In pursuit of this functionality, we apply tools from discrete sequence modeling to model how vehicles, pedestrians and cyclists interact in driving scenarios.Using a simple data-driven tokenization scheme, we discretize trajectories to centimeter-level resolution using a small vocabulary. We then model the multi-agent sequence of discrete motion tokens with a GPT-like encoder-decoder that is autoregressive in time and takes into account intra-timestep interaction between agents.Scenarios sampled from our model exhibit state-of-the-art realism; our model tops the Waymo Sim Agents Benchmark, surpassing prior work along the realism meta metric by 3.3% and along the interaction metric by 9.9%. We ablate our modeling choices in full autonomy and partial autonomy settings, and show that the representations learned by our model can quickly be adapted to improve performance on nuScenes.We additionally evaluate the scalability of our model with respect to parameter count and dataset size, and use density estimates from our model to quantify the saliency of context length and intra-timestep interaction for the traffic modeling task.

1 Introduction

In the short term, self-driving vehicles will be deployed on roadways that are largely populated by human drivers. For these early self-driving vehicles to share the road safely, it is imperative that they become fluent in the ways people interpret and respond to motion. A failure on the part of a self-driving vehicle to predict the intentions of people can lead to overconfident or overly cautious planning. A failure on the part of a self-driving vehicle to communicate to people its own intentions can endanger other road users by surprising them with uncommon maneuvers.

In this work, we propose an autoregressive model of the motion of road users that can be used to simulate how humans might react if a self-driving system were to choose a given sequence of actions. At test time, as visualized in Fig.1, the model functions as a policy, outputting a categorical distribution over the set of possible states an agent might move to at each timestep. Iteratively sampling actions from the model results in diverse, scene-consistent multi-agent rollouts of arbitrary length.We call our approach Trajeglish (“tra-JEG-lish”) due to the fact that we model multi-agent trajectories as a sequence of discrete tokens, similar to the representation used in language modeling, and to make an analogy between how road users use vehicle motion to communicate and how people use verbal languages, like English, to communicate.

A selection of samples from our model is visualized in Fig.2. When generating these samples, the model is prompted with only the initial position and heading of the agents, in contrast to prior work that generally requires at least one second of historical motion to begin sampling. Our model generates diverse outcomes for each scenario, while maintaining the scene-consistency of the trajectories. We encourage readers to consult our project page for videos of scenarios sampled from our model in full control and partial control settings, as well as longer rollouts of length 20 seconds.

Our main contributions are:

  • A simple data-driven method for tokenizing trajectory data we call “k-disks” that enables us to tokenize the Waymo Open Dataset (WOMD) (Ettinger etal., 2021) at an expected discretization error of 1 cm using a small vocabulary size of 384.

  • A transformer-based architecture for modeling sequences of motion tokens that conditions on map information and one or more initial states per agent. Our model outputs a distribution over actions for agents one at a time which we show is ideal for interactive applications.

  • State-of-the-art quantitative and qualitative results when sampling rollouts given real-world initializations both when the traffic model controls all agents in the scene as well as when the model must interact with agents outside its control.

We additionally evaluate the scalability of our model with respect to parameter count and dataset size, visualize the representations learned by our model, and use density estimates from our model to quantify the extent to which intra-timestep dependence exists between agents, as well as to measure the relative importance of long context lengths for traffic modeling (see Sec.4.3).

1.1 Related Work

Our work builds heavily on recent work in imitative traffic modeling. The full family of generative models have been applied to this problem, including VAEs (Suo etal., 2021; Rempe etal., 2021), GANs (Igl etal., 2022), and diffusion models (Zhong etal., 2022; Jiang etal., 2023). While these approaches primarily focus on modeling the multi-agent joint distribution over future trajectories, our focus in this work is additionally on building reactivity into the generative model, for which the factorization provided by autoregression is well-suited. For the structure of our encoder-decoder, we draw inspiration from Scene Transformer (Ngiam etal., 2021) which also uses a global coordinate frame to encode multi-agent interaction, but does not tokenize data and instead trains their model with a masked regression strategy. A limitation of regression is that it’s unclear if the Gaussian or Laplace mixture distribution is flexible enough to represent the distribution over the next state, whereas with tokenization, we know that all scenarios in WOMD are within the scope of our model, the only challenge is learning the correct logits. A comparison can also be made to the behavior cloning baselines used in Symphony (Igl etal., 2022) and “Imitation Is Not Enough” (Lu etal., 2023) which also predict a categorical distribution over future states, except our models are trained directly on pre-tokenized trajectories as input, and through the use of the transformer decoder, each embedding receives supervision for predicting the next token as well as all future tokens for all agents in the scene.In terms of tackling the problem of modeling complicated continuous distributions by tokenizing and applying autoregression, our work is most similar to Trajectory Transformer (Janner etal., 2021) which applies a fixed-grid tokenization strategy to model state-action sequences for RL. Finally, our work parallels MotionLM (Seff etal., 2023) which is concurrent work that also uses discrete sequence modeling for motion prediction, but targets 1- and 2-agent online interaction prediction inistead of N𝑁Nitalic_N-agent offline closed-loop simulation.

Trajeglish: Traffic Modeling as Next-Token Prediction (1)

Trajeglish: Traffic Modeling as Next-Token Prediction (2)

Trajeglish: Traffic Modeling as Next-Token Prediction (3)

2 Imitative Traffic Modeling

In this section, we show that the requirement that traffic models must interact with all agents at each timestep of simulation, independent of the method used to control each of the agents, imposes certain structural constraints on how the multi-agent future trajectory distribution is factored by imitative traffic models. Similar motivation is provided to justify the conditions for submissions to the WOMD sim agents benchmark to be considered valid closed-loop policies (Montali etal., 2023).

We are given an initial scene with N𝑁Nitalic_N agents, where a scene consists of map information, the dimensions and object class for each of the N𝑁Nitalic_N agents, and the location and heading for each of the agents for some number of timesteps in the past. For convenience, we denote information about the scene provided at initialization by 𝒄𝒄\bm{c}bold_italic_c.We denote the state of a vehicle i𝑖iitalic_i at future timestep t𝑡titalic_t by 𝒔ti(xti,yti,hti)superscriptsubscript𝒔𝑡𝑖subscriptsuperscript𝑥𝑖𝑡subscriptsuperscript𝑦𝑖𝑡subscriptsuperscript𝑖𝑡{\bm{s}}_{t}^{i}\equiv(x^{i}_{t},y^{i}_{t},h^{i}_{t})bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ≡ ( italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_h start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) where (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) is the center of the agent’s bounding box and hhitalic_h is the heading. For a scenario of length T𝑇Titalic_T timesteps, the distribution of interest for traffic modeling is given by

p(𝒔11,,𝒔1N,𝒔21,,𝒔2N,,𝒔T1,,𝒔TN𝒄).𝑝superscriptsubscript𝒔11superscriptsubscript𝒔1𝑁superscriptsubscript𝒔21superscriptsubscript𝒔2𝑁superscriptsubscript𝒔𝑇1conditionalsuperscriptsubscript𝒔𝑇𝑁𝒄\displaystyle p(\bm{s}_{1}^{1},...,\bm{s}_{1}^{N},\bm{s}_{2}^{1},...,\bm{s}_{2%}^{N},...,\bm{s}_{T}^{1},...,\bm{s}_{T}^{N}\mid\bm{c}).italic_p ( bold_italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , bold_italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , bold_italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , … , bold_italic_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , bold_italic_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∣ bold_italic_c ) .(1)

We refer to samples from this distribution as rollouts. In traffic modeling, our goal is to sample rollouts under the restriction that at each timestep, a black-box autonomous vehicle (AV) system chooses a state for a subset of the agents. We refer to the agents controlled by the traffic model as “non-player characters” or NPCs. This interaction model imposes the following factorization of the joint likelihood expressed in Eq.1

p(𝒔11,,𝒔1N,𝒔21,,𝒔2N,,𝒔T1,,𝒔TN𝒄)=1tTp(𝒔t1N0|𝒄,𝒔1t1)p(𝒔tN0+1N𝒄,𝒔1t1,𝒔t1N0)NPCs𝑝superscriptsubscript𝒔11superscriptsubscript𝒔1𝑁superscriptsubscript𝒔21superscriptsubscript𝒔2𝑁superscriptsubscript𝒔𝑇1conditionalsuperscriptsubscript𝒔𝑇𝑁𝒄subscriptproduct1𝑡𝑇𝑝conditionalsuperscriptsubscript𝒔𝑡1subscript𝑁0𝒄subscript𝒔1𝑡1subscript𝑝conditionalsubscriptsuperscript𝒔subscript𝑁01𝑁𝑡𝒄subscript𝒔1𝑡1superscriptsubscript𝒔𝑡1subscript𝑁0NPCs\displaystyle\begin{split}&p(\bm{s}_{1}^{1},...,\bm{s}_{1}^{N},\bm{s}_{2}^{1},%...,\bm{s}_{2}^{N},...,\bm{s}_{T}^{1},...,\bm{s}_{T}^{N}\mid\bm{c})\\&=\prod_{1\leq t\leq T}p(\bm{s}_{t}^{1...N_{0}}|\bm{c},\bm{s}_{1...t-1})%\underbrace{p(\bm{s}^{N_{0}+1...N}_{t}\mid\bm{c},\bm{s}_{1...t-1},\bm{s}_{t}^{%1...N_{0}})}_{\text{NPCs}}\end{split}start_ROW start_CELL end_CELL start_CELL italic_p ( bold_italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , bold_italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , bold_italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , … , bold_italic_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , bold_italic_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∣ bold_italic_c ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∏ start_POSTSUBSCRIPT 1 ≤ italic_t ≤ italic_T end_POSTSUBSCRIPT italic_p ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 … italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | bold_italic_c , bold_italic_s start_POSTSUBSCRIPT 1 … italic_t - 1 end_POSTSUBSCRIPT ) under⏟ start_ARG italic_p ( bold_italic_s start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 … italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_italic_c , bold_italic_s start_POSTSUBSCRIPT 1 … italic_t - 1 end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 … italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_ARG start_POSTSUBSCRIPT NPCs end_POSTSUBSCRIPT end_CELL end_ROW(2)

where 𝒔1t1{𝒔11,𝒔12,,𝒔t1N}subscript𝒔1𝑡1superscriptsubscript𝒔11superscriptsubscript𝒔12superscriptsubscript𝒔𝑡1𝑁\bm{s}_{1...t-1}\equiv\{\bm{s}_{1}^{1},\bm{s}_{1}^{2},...,\bm{s}_{t-1}^{N}\}bold_italic_s start_POSTSUBSCRIPT 1 … italic_t - 1 end_POSTSUBSCRIPT ≡ { bold_italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , bold_italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT } is the set of all states for all agents prior to timestep t𝑡titalic_t, 𝒔t1N0{𝒔t1,,𝒔tN}subscriptsuperscript𝒔1subscript𝑁0𝑡superscriptsubscript𝒔𝑡1superscriptsubscript𝒔𝑡𝑁\bm{s}^{1...N_{0}}_{t}\equiv\{\bm{s}_{t}^{1},...,\bm{s}_{t}^{N}\}bold_italic_s start_POSTSUPERSCRIPT 1 … italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≡ { bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT } is the set of states for agents 1 through N𝑁Nitalic_N at time t𝑡titalic_t, and we arbitrarily assigned the agents out of the traffic model’s control to have indices 1,,N01subscript𝑁01,...,N_{0}1 , … , italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The factorization inEq.2 shows that we seek a model from which we can sample an agent’s next state conditional on all states sampled in previous timesteps as well as any states already sampled at the current timestep.

We note that, although the real-world system that generated the driving data involves independent actors, it may still be important to model the influence of actions chosen by other agents at the same timestep, a point we expand on in AppendixA.1. While intra-timestep interaction between agents is weak in general, explicitly modeling this interaction provides a window into understanding cases when it is important to consider for the purposes of traffic modeling.

3 Method

In this section, we introduce Trajeglish, an autoregressive generative model of dynamic driving scenarios. Trajeglish consists of two components. The first component is a strategy for discretizing, or “tokenizing” driving scenarios such that we model exactly the conditional distributions required by the factorization of the joint likelihood in Eq.2. The second component is an autoregressive transformer-based architecture for modeling the distribution of tokenized scenarios.

Important features of Trajeglish include that it preserves the dynamic factorization of the full likelihood for dynamic test-time interaction, it accounts for intra-timestep coupling across agents, and it enables both efficient sampling of scenarios as well as density estimates. While sampling is the primary objective for traffic modeling, we show in Sec.4.3 that the density estimates from Trajeglish are useful for understanding the importance of longer context lengths and intra-timestep dependence. We introduce our tokenization strategy in Sec.3.1 and our autoregressive model in Sec.3.2.

Trajeglish: Traffic Modeling as Next-Token Prediction (4)
Trajeglish: Traffic Modeling as Next-Token Prediction (5)
Trajeglish: Traffic Modeling as Next-Token Prediction (6)

3.1 Tokenization

The goal of tokenization is to model the support of a continuous distribution as a set of |V|𝑉|V|| italic_V | discrete options. Given 𝒙np(𝒙)𝒙superscript𝑛similar-to𝑝𝒙{\bm{x}}\in\mathbb{R}^{n}\sim p({\bm{x}})bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∼ italic_p ( bold_italic_x ), a tokenizer is a function that maps samples from the continuous distribution to one of the discrete options f:nV:𝑓superscript𝑛𝑉f:\mathbb{R}^{n}\rightarrow Vitalic_f : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → italic_V. A renderer is a function that maps the discrete options back to raw input r:Vn:𝑟𝑉superscript𝑛r:V\rightarrow\mathbb{R}^{n}italic_r : italic_V → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. A high-quality tokenizer-renderer pair is one such that r(f(𝒙))𝒙𝑟𝑓𝒙𝒙r(f(\bm{x}))\approx\bm{x}italic_r ( italic_f ( bold_italic_x ) ) ≈ bold_italic_x. The continuous distributions that we seek to tokenize for the case of traffic modeling are given by Eq.1. We note that these distributions are over single-agent states consisting of only a position and heading. Given the low dimensionality of the input data, we propose a simple approach for tokenizing trajectories based on a fixed set of state-to-state transitions.

Trajeglish: Traffic Modeling as Next-Token Prediction (7)

Method

Let 𝒔0subscript𝒔0{\bm{s}}_{0}bold_italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT be the state of an agent with length l𝑙litalic_l and width w𝑤witalic_w at the current timestep. Let 𝒔𝒔{\bm{s}}bold_italic_s be the state at the next timestep that we seek to tokenize. We define V={𝒔i}𝑉subscript𝒔𝑖V=\{\bm{s}_{i}\}italic_V = { bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } to be a set of template actions, each of which represents a change in position and heading in the coordinate frame of the most recent state. We use the notation aisubscript𝑎𝑖a_{i}\in\mathbb{N}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_N to indicate the index representation of token template 𝒔isubscript𝒔𝑖\bm{s}_{i}bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝒔^^𝒔\hat{\bm{s}}over^ start_ARG bold_italic_s end_ARG to represent the raw representation of the tokenized state 𝒔𝒔\bm{s}bold_italic_s. Our tokenizer f𝑓fitalic_f and renderer r𝑟ritalic_r are defined by

f(𝒔0,𝒔)=ai*=argminidl,w(𝒔i,local(𝒔0,𝒔))𝑓subscript𝒔0𝒔subscript𝑎superscript𝑖subscriptargmin𝑖subscript𝑑𝑙𝑤subscript𝒔𝑖localsubscript𝒔0𝒔\displaystyle f({\bm{s}}_{0},{\bm{s}})=a_{i^{*}}=\operatorname*{arg\,min}_{i}d%_{l,w}({\bm{s}}_{i},\mathrm{local}({\bm{s}}_{0},{\bm{s}}))italic_f ( bold_italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_s ) = italic_a start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_l , italic_w end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_local ( bold_italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_s ) )(3)
r(𝒔0,ai)=𝒔^=global(𝒔0,𝒔i)𝑟subscript𝒔0subscript𝑎𝑖^𝒔globalsubscript𝒔0subscript𝒔𝑖\displaystyle r({\bm{s}}_{0},a_{i})=\hat{{\bm{s}}}=\mathrm{global}({\bm{s}}_{0%},{\bm{s}}_{i})italic_r ( bold_italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = over^ start_ARG bold_italic_s end_ARG = roman_global ( bold_italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )(4)

where dl,w(𝒔0,𝒔1)subscript𝑑𝑙𝑤subscript𝒔0subscript𝒔1d_{l,w}({\bm{s}}_{0},{\bm{s}}_{1})italic_d start_POSTSUBSCRIPT italic_l , italic_w end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) is the average of the L2 distances between the ordered corners of the bounding boxes defined by 𝒔0subscript𝒔0{\bm{s}}_{0}bold_italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝒔1subscript𝒔1{\bm{s}}_{1}bold_italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, “local” converts 𝒔𝒔{\bm{s}}bold_italic_s to the local frame of 𝒔0subscript𝒔0{\bm{s}}_{0}bold_italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and “global” converts 𝒔i*subscript𝒔superscript𝑖{\bm{s}}_{i^{*}}bold_italic_s start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT to the global frame out of the local frame of 𝒔0subscript𝒔0{\bm{s}}_{0}bold_italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We use dl,w(,)subscript𝑑𝑙𝑤d_{l,w}(\cdot,\cdot)italic_d start_POSTSUBSCRIPT italic_l , italic_w end_POSTSUBSCRIPT ( ⋅ , ⋅ ) throughout the rest of the paper to refer to this mean corner distance metric. Importantly, in order to tokenize a full trajectory, this process of converting states 𝒔𝒔{\bm{s}}bold_italic_s to their tokenized counterpart 𝒔^^𝒔\hat{{\bm{s}}}over^ start_ARG bold_italic_s end_ARG is done iteratively along the trajectory, using tokenized states as the base state 𝒔0subscript𝒔0{\bm{s}}_{0}bold_italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in the next tokenization step.We visualize the procedure for tokenizing a trajectory in Fig.3.Tokens generated with our approach have three convenient properties for the purposes of traffic modeling: they are invariant across coordinate frames, invariant under temporal shift, and they supply efficient access to a measure of similarity between tokens, namely the distance between the raw representations. We discuss how to exploit the third property for data augmentation in Sec.A.2.

Optimizing template sets

We propose an easily parallelizable approach for finding template sets with low discretization error. We collect a large number of state transitions observed in data, sample one of them, filter transitions that are within ϵitalic-ϵ\epsilonitalic_ϵ meters, and repeat |V|𝑉|V|| italic_V | times. Pseudocode for this algorithm is included in Alg.1. We call this method for sampling candidate templates “k-disks” given its similarity to k-means++, the standard algorithm for seeding the anchors k-means (Arthur & Vassilvitskii, 2007), as well as the Poisson disk sampling algorithm (Cook, 1986). We visualize the template sets found using k-disks with minimum discretization error in Fig.4. We verify in Fig.5 that the tokenized action distribution is similar on WOMD train and validation despite the fact that the templates are optimized on the training set. We show in Fig.6 that the discretization error induced by templates sampled with k-disks is in general much better than that of k-means, across agent types. A comprehensive evaluation of k-disks in comparison to baselines is in Sec.A.3.

Trajeglish: Traffic Modeling as Next-Token Prediction (8)

3.2 Modeling

The second component of our method is an architecture for learning a distribution over the sequences of tokens output by the first. Our model follows an encoder-decoder structure very similar to those used for LLMs (Vaswani etal., 2017; Radford etal., 2019; Raffel etal., 2019). A diagram of the model is shown in Fig.7. Two important properties of our encoder are that it is not equivariant to choice of global coordinate frame and it is not permutation equivariant to agent order. For the first property, randomizing the choice of coordinate frame during training is straightforward, and sharing a global coordinate frame enables shared processing and representation learning across agents. For the second property, permutation equivariance is not actually desirable in our case since the agent order encodes the order in which agents select actions within a timestep; the ability of our model to predict actions should improve when the already-chosen actions of other agents are provided.

Encoder

Our model takes as input two modalities that encode the initial scene. The first is the initial state of the agents in the scene which includes the length, width, initial position, initial heading, and object class. We apply a single-layer MLP to encode these values per-agent to an embedding of size C𝐶Citalic_C. We then add a positional embedding that encodes the agent’s order as well as agent identity across the action sequence. The second modality is the map. We use the WOMD representation of a map as a collection of “map objects”, where a map object is a variable-length polyline representing a lane, a sidewalk, or a crosswalk, for example. We apply a VectorNet encoder to encode the map to a sequence of embeddings for at most M𝑀Mitalic_M map objects (Gao etal., 2020). Note that although the model is not permutation equivariant to the agents, it is permutation invariant to the ordering of the map objects. Similar to Wayformer (Nayakanti etal., 2022), we then apply a layer of latent query attention that outputs a final encoding of the scene initialization.

Decoder

Given the set of multi-agent future trajectories, we tokenize the trajectories and flatten using the same order used to apply positional embeddings to the t=0𝑡0t=0italic_t = 0 agent encoder to get a sequence a00a10aNTsuperscriptsubscript𝑎00superscriptsubscript𝑎10superscriptsubscript𝑎𝑁𝑇a_{0}^{0}a_{1}^{0}...a_{N}^{T}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT … italic_a start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. We then prepend a start token and pop the last token, and use an embedding table to encode the result. For timesteps for which an agent’s state wasn’t observed in the data, we set the embedding to zeros. We pass the full sequence through a transformer with causal mask during training. Finally, we use a linear layer to decode a distribution over the |V|𝑉|V|| italic_V | template states and train to maximize the probability of the next token with cross-entropy loss.We tie the token embedding matrix to the weight of the final linear layer, which we observed results in small improvements (Press & Wolf, 2017). We leverage flash attention (Dao etal., 2022) which we find greatly speeds up training time, as documented in Sec.A.8.

We highlight that although the model is trained to predict the next token, it is incorrect to say that a given embedding for the motion token of a given agent only receives supervision signal for the task of predicting the next token. Since the embeddings for later tokens attend to the embeddings of earlier tokens, the embedding at a given timestep receives signal for the task of predicting all future tokens across all agents.

4 Experiments

We use the Waymo Open Motion Dataset (WOMD) to evaluate Trajeglish in full and partial control environments. We report results for rollouts produced by Trajeglish on the official WOMD Sim Agents Benchmark in Sec.4.1. We then ablate our design choices in simplified full and partial control settings in Sec.4.2. Finally, we analyze the representations learned by our model and the density estimates it provides in Sec.4.3. The hyperparameters for each of the models that we train can be found in Sec.A.4.

Method Realism metametric \uparrow Kinematicmetrics \uparrow Interactivemetrics \uparrow Map-basedmetrics \uparrow minADE (m) \downarrowConstant Velocity0.23800.04650.33720.36807.924Wayformer (Identical)0.42500.31200.44820.56202.498MTR+++0.46970.35970.49290.60281.682Wayformer (Diverse)0.47200.36130.49350.60771.694Joint-Multipath++0.48880.40730.49910.60182.052MTR_E*0.49110.41800.49050.60731.656MVTA0.50910.41750.51860.63741.870MVTE*0.51680.42020.52890.64861.677Trajeglish0.53390.40190.58110.66671.872

4.1 WOMD Sim Agents Benchmark

We test the sampling performance of our model using the WOMD Sim Agents Benchmark and report results in Tab.1. Submissions to this benchmark are required to submit 32 rollouts of length 8 seconds at 10hz per scenario, each of which contains up to 128 agents. We bold multiple submissions if they are within 1% of each other, as in Montali etal. (2023). Trajeglish is the top submission along the leaderboard meta metric, outperforming several well-established motion prediction models including Wayformer, MultiPath++, and MTR (Shi etal., 2022; 2023), while being the first submission to use discrete sequence modeling. Most of the improvement is due to the fact that Trajeglish models interaction between agents significantly better than prior work, increasing the state-of-the-art along interaction metrics by 9.9%. A full description of how we sample from the model for this benchmark with comparisons on the WOMD validation set is included in AppendixA.5.

Trajeglish: Traffic Modeling as Next-Token Prediction (9)

4.2 Ablation

To simplify our ablation study, we test models in this section on the scenarios they train on, of at most 24 agents and 6.4 seconds in length. We compare performance across 5 variants of our model. Both “trajeglish” and “trajeglish w/ reg.” refer to our model, the latter using the noisy tokenization strategy discussed in Sec.A.2. The “no intra” model is an important baseline designed to mimic the behavior of behavior cloning baselines used in Symphony (Igl etal., 2022) and “Imitation Is Not Enough” (Lu etal., 2023). For this baseline, we keep the same architecture but adjust the masking strategy in the decoder to not attend to actions already chosen for the current timestep. The “marginal” baseline is designed to mimic the behavior of models such as Wayformer (Nayakanti etal., 2022) and MultiPath++ (Varadarajan etal., 2021) that are trained to model the distribution over single-agent trajectories instead of multi-agent scene-consistent trajectories. For this baseline, we keep the same architecture but apply a mask to the decoder that enforces that the model can only attend to previous actions chosen by the current agent. Our final baseline is the same as the marginal baseline but without a map encoder. We use this baseline to understand the extent to which the models rely on the map for traffic modeling.

Partial control

We report results in Fig.8 in a partial controllability setting in which a single agent in each scenario is chosen to be controlled by the traffic model and all other agents are set to replay. The single-agent ADE (average distance error) for the controlled-agent is similar in full autonomy rollouts for all models other than the model that does not condition on the map, as expected. However, in rollouts where all other agents are placed on replay, the replay trajectories leak information about the trajectory that the controlled-agent took in the data, and as a result, the no-intra and trajeglish rollouts have a lower ADE. Additionally, the Trajeglish rollouts in which the controlled agent is placed first do not condition on intra-timestep information and therefore behave identically to the no-intra baseline, whereas rollouts where the controlled-agent is placed last in the order provide the model with more information about the replay trajectories and result in a decreased ADE.

Full control

We evaluate the collision rate of models under full control in Fig.9 as a function of initial context, object category, and rollout duration. The value of modeling intra-timestep interaction is most obvious when only a single timestep is used to seed generation, although intra-timestep modeling significantly improves the collision rate in all cases for vehicles. For interaction between pedestrians, Trajeglish is able to capture the grouping behavior effectively. We observe that noising the tokens during training improves rollout performance slightly in the full control setting. We expect these rates to improve quickly given more training data, as suggested by Fig.4.2.

Trajeglish: Traffic Modeling as Next-Token Prediction (10)
Trajeglish: Traffic Modeling as Next-Token Prediction (11)

Trajeglish: Traffic Modeling as Next-Token Prediction (12)\captionoffigureScaling Behavior Our preliminary study on parameter and dataset scaling suggests that, compared to LLMs (Kaplan etal., 2020), Trajeglish is severely data-constrained on WOMD; models with 35M parameters just start to be significantly better than models with 15M parameters for datasets the size of WOMD. A more rigorous study of how all hyperparameters of the training strategy affect sampling performance is reserved for future work.

Trajeglish: Traffic Modeling as Next-Token Prediction (13)\captionoffigurenuScenes transfer We test the ability of our model to transfer to the maps and scenario initializations in the nuScenes dataset. The difference between maps and behaviors found in the nuScenes dataset are such that LoRA does not provide enough expressiveness to fine-tune the model to peak performance. The fine-tuned models both outperform and train faster than the model that is trained exclusively on nuScenes.

4.3 Analysis

Intra-Timestep Dependence

To understand the extent to which our model leverages intra-timestep dependence, in Fig.10, we evaluate the negative log likelihood under our model of predicting an agent’s next action depending on the agent’s order in the selected permutation, as a function of the amount of historical context the model is provided. In all cases, the agent gains predictive power from conditioning on the actions selected by other agents within the same timestep, but the log likelihood levels out as more historical context is provided. Intra-timestep dependence is significantly less important when provided over 4 timesteps of history, as is the setting used for most motion prediction benchmarks.

Representation Transferability

We measure the generalization of our model to the nuScenes dataset (Caesar etal., 2019). As recorded in Sec.A.8, nuScenes is 3 orders of magnitude smaller than WOMD. Additionally, nuScenes includes scenes from Singapore where the lane convention is opposite that of North America where WOMD is collected. Nevertheless, we show in Fig.4.2 that our model can be fine-tuned to a validation NLL far lower than a model trained from scratch on only the nuScenes dataset. At the same time, we find that LoRA (Hu etal., 2021) does not provide enough expressiveness to achieve the same NLL as fine tuning the full model. While bounding boxes have a fairly canonical definition, we note that there are multiple arbitrary choices in the definition of map objects that may inhibit transfer of traffic models to different datasets.

Token Embeddings

We visualize the embeddings that the model learns in Fig.11. Through the task of predicting the next token, the model learns a similarity matrix across tokens that reflects the Euclidean distance between the raw actions that the tokens represent.

Preliminary Scaling Law

We perform a preliminary study of how our model scales with increased parameter count and dataset size in Fig.4.2. We find that performance between a model of 15.4M parameters and 35.6 parameters is equivalent up to 0.5B tokens, suggesting that a huge amount of performance gain is expected if the dataset size can be expanded beyond the 1B tokens in WOMD. We reserve more extensive studies of model scaling for future work.

Trajeglish: Traffic Modeling as Next-Token Prediction (14)

5 Conclusion

In this work, we introduce a discrete autoregressive model of the interaction between road users. By improving the realism of self-driving simulators, we hope to enhance the safety of self-driving systems as they are increasingly deployed into the real world.

References

  • Arthur & Vassilvitskii (2007)David Arthur and Sergei Vassilvitskii.K-means++: The advantages of careful seeding.In Proceedings of the Eighteenth Annual ACM-SIAM Symposium onDiscrete Algorithms, SODA ’07, pp. 1027–1035, USA, 2007. Society forIndustrial and Applied Mathematics.ISBN 9780898716245.
  • Caesar etal. (2019)Holger Caesar, Varun Bankiti, AlexH. Lang, Sourabh Vora, VeniceErin Liong,Qiang Xu, Anush Krishnan, YuPan, Giancarlo Baldan, and Oscar Beijbom.nuscenes: A multimodal dataset for autonomous driving.CoRR, abs/1903.11027, 2019.URL http://arxiv.org/abs/1903.11027.
  • Cook (1986)RobertL. Cook.Stochastic sampling in computer graphics.ACM Trans. Graph., 5(1):51–72, jan 1986.ISSN 0730-0301.doi: 10.1145/7529.8927.URL https://doi.org/10.1145/7529.8927.
  • Dao etal. (2022)Tri Dao, DanielY. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré.Flashattention: Fast and memory-efficient exact attention withio-awareness, 2022.
  • Ettinger etal. (2021)Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, SabeekPradhan, Yuning Chai, Ben Sapp, CharlesR. Qi, Yin Zhou, Zoey Yang,Aurélien Chouard, Pei Sun, Jiquan Ngiam, Vijay Vasudevan, AlexanderMcCauley, Jonathon Shlens, and Dragomir Anguelov.Large scale interactive motion forecasting for autonomous driving:The waymo open motion dataset.In Proceedings of the IEEE/CVF International Conference onComputer Vision (ICCV), pp. 9710–9719, October 2021.
  • Gao etal. (2020)Jiyang Gao, Chen Sun, Hang Zhao, YiShen, Dragomir Anguelov, Congcong Li, andCordelia Schmid.Vectornet: Encoding hd maps and agent dynamics from vectorizedrepresentation, 2020.
  • Holtzman etal. (2020)Ari Holtzman, Jan Buys, LiDu, Maxwell Forbes, and Yejin Choi.The curious case of neural text degeneration.In International Conference on Learning Representations, 2020.URL https://openreview.net/forum?id=rygGQyrFvH.
  • Hu etal. (2023)Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, AlexKendall, Jamie Shotton, and Gianluca Corrado.Gaia-1: A generative world model for autonomous driving, 2023.
  • Hu etal. (2021)EdwardJ. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li,Shean Wang, and Weizhu Chen.Lora: Low-rank adaptation of large language models.CoRR, abs/2106.09685, 2021.URL https://arxiv.org/abs/2106.09685.
  • Igl etal. (2022)Maximilian Igl, Daewoo Kim, Alex Kuefler, Paul Mougin, Punit Shah, KyriacosShiarlis, Dragomir Anguelov, Mark Palatucci, Brandyn White, and ShimonWhiteson.Symphony: Learning realistic and diverse agents for autonomousdriving simulation, 2022.
  • Janner etal. (2021)Michael Janner, Qiyang Li, and Sergey Levine.Offline reinforcement learning as one big sequence modeling problem.In Advances in Neural Information Processing Systems, 2021.
  • Jiang etal. (2023)ChiyuMax Jiang, Andre Cornman, Cheolho Park, Ben Sapp, Yin Zhou, and DragomirAnguelov.Motiondiffuser: Controllable multi-agent motion prediction usingdiffusion, 2023.
  • Kaplan etal. (2020)Jared Kaplan, Sam McCandlish, Tom Henighan, TomB. Brown, Benjamin Chess, RewonChild, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei.Scaling laws for neural language models.CoRR, abs/2001.08361, 2020.URL https://arxiv.org/abs/2001.08361.
  • Loshchilov & Hutter (2017)Ilya Loshchilov and Frank Hutter.Fixing weight decay regularization in adam.CoRR, abs/1711.05101, 2017.URL http://arxiv.org/abs/1711.05101.
  • Lu etal. (2023)Yiren Lu, Justin Fu, George Tucker, Xinlei Pan, Eli Bronstein, Rebecca Roelofs,Benjamin Sapp, Brandyn White, Aleksandra Faust, Shimon Whiteson, DragomirAnguelov, and Sergey Levine.Imitation is not enough: Robustifying imitation with reinforcementlearning for challenging driving scenarios, 2023.
  • Montali etal. (2023)Nico Montali, John Lambert, Paul Mougin, Alex Kuefler, Nick Rhinehart, MichelleLi, Cole Gulino, Tristan Emrich, Zoey Yang, Shimon Whiteson, Brandyn White,and Dragomir Anguelov.The waymo open sim agents challenge, 2023.
  • Nayakanti etal. (2022)Nigamaa Nayakanti, Rami Al-Rfou, Aurick Zhou, Kratarth Goel, KhaledS. Refaat,and Benjamin Sapp.Wayformer: Motion forecasting via simple and efficient attentionnetworks, 2022.
  • Ngiam etal. (2021)Jiquan Ngiam, Benjamin Caine, Vijay Vasudevan, Zhengdong Zhang,Hao-TienLewis Chiang, Jeffrey Ling, Rebecca Roelofs, Alex Bewley, ChenxiLiu, Ashish Venugopal, David Weiss, Benjamin Sapp, Zhifeng Chen, and JonathonShlens.Scene transformer: A unified multi-task model for behaviorprediction and planning.CoRR, abs/2106.08417, 2021.URL https://arxiv.org/abs/2106.08417.
  • Philion (2019)Jonah Philion.Fastdraw: Addressing the long tail of lane detection by adapting asequential prediction network.CoRR, abs/1905.04354, 2019.URL http://arxiv.org/abs/1905.04354.
  • Press & Wolf (2017)Ofir Press and Lior Wolf.Using the output embedding to improve language models, 2017.
  • Radford etal. (2019)Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and IlyaSutskever.Language models are unsupervised multitask learners.2019.
  • Raffel etal. (2019)Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, MichaelMatena, Yanqi Zhou, Wei Li, and PeterJ. Liu.Exploring the limits of transfer learning with a unified text-to-texttransformer.CoRR, abs/1910.10683, 2019.URL http://arxiv.org/abs/1910.10683.
  • Ranzato etal. (2016)Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba.Sequence level training with recurrent neural networks.In Yoshua Bengio and Yann LeCun (eds.), 4th InternationalConference on Learning Representations, ICLR 2016, San Juan, Puerto Rico,May 2-4, 2016, Conference Track Proceedings, 2016.URL http://arxiv.org/abs/1511.06732.
  • Rempe etal. (2021)Davis Rempe, Jonah Philion, LeonidasJ. Guibas, Sanja Fidler, and OrLitany.Generating useful accident-prone driving scenarios via a learnedtraffic prior.CoRR, abs/2112.05077, 2021.URL https://arxiv.org/abs/2112.05077.
  • Ross & Bagnell (2010)Stephane Ross and Drew Bagnell.Efficient reductions for imitation learning.In YeeWhye Teh and Mike Titterington (eds.), Proceedings ofthe Thirteenth International Conference on Artificial Intelligence andStatistics, volume9 of Proceedings of Machine Learning Research,pp. 661–668, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR.URL https://proceedings.mlr.press/v9/ross10a.html.
  • Seff etal. (2023)Ari Seff, Brian Cera, Dian Chen, Mason Ng, Aurick Zhou, Nigamaa Nayakanti,KhaledS. Refaat, Rami Al-Rfou, and Benjamin Sapp.Motionlm: Multi-agent motion forecasting as language modeling, 2023.
  • Shi etal. (2022)Shaoshuai Shi, LiJiang, Dengxin Dai, and Bernt Schiele.Motion transformer with global intention localization and localmovement refinement.Advances in Neural Information Processing Systems, 2022.
  • Shi etal. (2023)Shaoshuai Shi, LiJiang, Dengxin Dai, and Bernt Schiele.Mtr++: Multi-agent motion prediction with symmetric scene modelingand guided intention querying.arXiv preprint arXiv:2306.17770, 2023.
  • Suo etal. (2021)Simon Suo, Sebastian Regalado, Sergio Casas, and Raquel Urtasun.Trafficsim: Learning to simulate realistic multi-agent behaviors,2021.
  • vanden Oord etal. (2016)Aäron vanden Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, OriolVinyals, Alex Graves, Nal Kalchbrenner, AndrewW. Senior, and KorayKavukcuoglu.Wavenet: A generative model for raw audio.CoRR, abs/1609.03499, 2016.URL http://arxiv.org/abs/1609.03499.
  • Varadarajan etal. (2021)Balakrishnan Varadarajan, Ahmed Hefny, Avikalp Srivastava, KhaledS. Refaat,Nigamaa Nayakanti, Andre Cornman, Kan Chen, Bertrand Douillard, Chi-PangLam, Dragomir Anguelov, and Benjamin Sapp.Multipath++: Efficient information fusion and trajectory aggregationfor behavior prediction.CoRR, abs/2111.14973, 2021.URL https://arxiv.org/abs/2111.14973.
  • Vaswani etal. (2017)Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,AidanN. Gomez, Lukasz Kaiser, and Illia Polosukhin.Attention is all you need.CoRR, abs/1706.03762, 2017.URL http://arxiv.org/abs/1706.03762.
  • Zhong etal. (2022)Ziyuan Zhong, Davis Rempe, Danfei Xu, Yuxiao Chen, Sushant Veer, Tong Che,Baishakhi Ray, and Marco Pavone.Guided conditional diffusion for controllable traffic simulation,2022.
  • Zhu etal. (2015)Yukun Zhu, Ryan Kiros, RichardS. Zemel, Ruslan Salakhutdinov, Raquel Urtasun,Antonio Torralba, and Sanja Fidler.Aligning books and movies: Towards story-like visual explanations bywatching movies and reading books.CoRR, abs/1506.06724, 2015.URL http://arxiv.org/abs/1506.06724.

Appendix A Appendix

A.1 Intra-Timestep Interaction

There are a variety of reasons that intra-timestep dependence may exist in driving log data. To list a few, driving logs are recorded at discrete timesteps and any interaction in the real world between timesteps gives the appearance of coordinated behavior in log data. Additionally, information that is not generally recorded in log data, such as eye contact or turn signals, may lead to intra-timestep dependence. Finally, the fact that log data exists in 10-20 second chunks can result in intra-timestep dependence if there were events before the start of the log data that result in coordination during the recorded scenario. These factors are in general weak, but may give rise to behavior in rare cases that is not possible to model without taking into account coordinatation across agents within a single timestep.

A.2 Regularization

Trajeglish is trained with teacher forcing, meaning that it is trained on the tokenized representation of ground-truth trajectories. However, at test time, the model ingests its own actions. Given that the model does not model the ground-truth distribution perfectly, there is an inevitable mismatch between the training and test distributions that can lead to compounding errors (Ross & Bagnell, 2010; Ranzato etal., 2016; Philion, 2019). We combat this effect by noising the input tokens fed as input to the model. More concretely, when tokenizing the input trajectories, instead of choosing the token with minimum corner distance to the ground-truth state as stated in Eq.3, we sample the token from the distribution

aisoftmaxi(nucleus(d(𝒔i,𝒔)/σ,ptop))similar-tosubscript𝑎𝑖subscriptsoftmax𝑖nucleus𝑑subscript𝒔𝑖𝒔𝜎subscript𝑝top\displaystyle a_{i}\sim\mathrm{softmax}_{i}(\mathrm{nucleus}(d(\bm{s}_{i},\bm{%s})/\sigma,p_{\mathrm{top}}))italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ roman_softmax start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_nucleus ( italic_d ( bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_s ) / italic_σ , italic_p start_POSTSUBSCRIPT roman_top end_POSTSUBSCRIPT ) )(5)

meaning we treat the the distance between the ground-truth raw state and the templates as logits of a categorical distribution with temperature σ𝜎\sigmaitalic_σ and apply nucleus sampling (Holtzman etal., 2020) to generate sequences of motion tokens. When σ=0𝜎0\sigma=0italic_σ = 0 and ptop=1subscript𝑝top1p_{\mathrm{top}}=1italic_p start_POSTSUBSCRIPT roman_top end_POSTSUBSCRIPT = 1, the approach recovers the tokenization strategy defined in Eq.3. Intuitively, if two tokens are equidistant from the ground-truth under the average corner distance metric, this approach will sample one of the two tokens with equal probability during training. Note that we retain the minimum-distance template index as the ground-truth target even when noising the input sequence.

While this method of regularization does make the model more robust to errors in its samples at test time, it also adds noise to the observation of the states of other agents which can make the model less responsive to the motion of other agents at test time. As a result, we find that this approach primarily improves performance for the setting where all agents are controlled by the traffic model.

Trajeglish: Traffic Modeling as Next-Token Prediction (15)
Trajeglish: Traffic Modeling as Next-Token Prediction (16)
Trajeglish: Traffic Modeling as Next-Token Prediction (17)

A.3 Tokenization Analysis

We compare our approach for tokenization against two grid-based tokenizers (vanden Oord etal., 2016; Seff etal., 2023; Janner etal., 2021), and one sampling-based tokenizer. The details of these methods are below.

(x,y,h)𝑥𝑦(x,y,h)( italic_x , italic_y , italic_h )-grid - We independently discretize change in longitudinal and lateral position and change in heading, and treat the template set as the product of these three sets. For vocabulary sizes of 128/256/384/512 respectively, we use 6/7/8/9 values for x𝑥xitalic_x and y𝑦yitalic_y, and 4/6/7/8 values for hhitalic_h. These values are spaced evenly between (-0.3, 3.5) m for x𝑥xitalic_x, (-0.2 m, 0.2 m) for y𝑦yitalic_y, and (-0.1, 0.1) rad for hhitalic_h.

(x,y)𝑥𝑦(x,y)( italic_x , italic_y )-grid - We independently discretize change in only the location. We choose the heading for each template based on the heading of the state-to-state transition found in the data with a change in location closest to the template location. Compared to the (x,y,h)𝑥𝑦(x,y,h)( italic_x , italic_y , italic_h )-grid baseline, this approach assumes heading is deterministic given location in order to gain resolution in location. We use 12/16/20/23 values for x𝑥xitalic_x and y𝑦yitalic_y with the same bounds as in the (x,y,h)𝑥𝑦(x,y,h)( italic_x , italic_y , italic_h )-grid baseline.

k-means - We run k-means many times on a dataset of (x,y,h)𝑥𝑦(x,y,h)( italic_x , italic_y , italic_h ) state-to-state transitions. The distance metric is the distance between the (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) locations. We note that the main source of randomness across runs is how k-means is seeded, for which we use k-means++ Arthur & Vassilvitskii (2007). We ultimately select the template set with minimum expected discretization error as measured by the average corner distance.

k-disks - As shown in Alg.1, we sample subsets of a dataset of state-to-state transitions that are at least ϵitalic-ϵ\epsilonitalic_ϵ from each other. For vocab sizes of 128/256/384/512, we use ϵitalic-ϵ\epsilonitalic_ϵ of 3.5/3.5/3.5/3.0 centimeters.

Intuitively, the issue with both grid-based methods is that they distribute templates evenly instead of focusing them in regions of the support where the most state transitions occur. The main issue with k-means is that the heading is not taken into account when optimizing the cluster centers.

We offer several comparisons between these methods. In Fig.12, we plot the expected corner distance between trajectories and tokenized trajectories as a function of trajectory length for the template sets found with k-disks. In Fig.13, we compare the tokenization error as a function of trajectory length and find that grid-based tokenizers create large oscillations. To calibrate to a metric more relevant to the traffic modeling task, we compare the collision rate between raw trajectories as a function of trajectory length for the raw scenarios and the tokenized scenarios using k-disk template sets of size 128, 256, 384, and 512 in Fig.14. We observe that a vocabulary size of 384 is sufficient to avoid creating extraneous collisions. Finally, Fig.15 plots the full distribution of discretization errors for each of the baselines and Tab.2 reports the expected discretization error across vocabulary sizes for each of the methods.

1:procedureSampleKDisks(X𝑋Xitalic_X, N𝑁Nitalic_N, ϵitalic-ϵ\epsilonitalic_ϵ)

2:S{}𝑆S\leftarrow\{\}italic_S ← { }

3:whilelen(S𝑆Sitalic_S) <<< N𝑁Nitalic_Ndo

4:x0Xsimilar-tosubscript𝑥0𝑋x_{0}\sim Xitalic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_X

5:X{xXd(x0,x)>ϵ}𝑋conditional-set𝑥𝑋𝑑subscript𝑥0𝑥italic-ϵX\leftarrow\{x\in X\mid d(x_{0},x)>\epsilon\}italic_X ← { italic_x ∈ italic_X ∣ italic_d ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x ) > italic_ϵ }

6:SS{x0}𝑆𝑆subscript𝑥0S\leftarrow S\cup\{x_{0}\}italic_S ← italic_S ∪ { italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT }return S𝑆Sitalic_S

Trajeglish: Traffic Modeling as Next-Token Prediction (18)
𝔼[d(s,s^)]𝔼delimited-[]𝑑𝑠^𝑠\mathbb{E}[d(s,\hat{s})]blackboard_E [ italic_d ( italic_s , over^ start_ARG italic_s end_ARG ) ] (cm)
method|V|=128𝑉128|V|=128| italic_V | = 128|V|=256𝑉256|V|=256| italic_V | = 256|V|=384𝑉384|V|=384| italic_V | = 384|V|=512𝑉512|V|=512| italic_V | = 512
(x,y,h)𝑥𝑦(x,y,h)( italic_x , italic_y , italic_h )-grid20.5016.8414.0912.59
(x,y)𝑥𝑦(x,y)( italic_x , italic_y )-grid9.358.715.934.74
k-means14.498.176.135.65
k-disks2.661.461.181.02

A.4 Training hyperparameters

We train two variants of our model. The variant we use for the WOMD benchmark is trained on scenarios with up to 24 agents within 60.0 meters of the origin, up to 96 map objects with map points within 100.0 meters of the origin, 2 map encoder layers, 2 transformer encoder layers, 6 transformer decoder layers, a hidden dimension of 512, trained to predict 32 future timesteps for all agents. We train with a batch size of 96, with a tokenization temperature of 0.008, a tokenization nucleus of 0.95, a top learning rate of 5e-4 with 500 step warmup and linear decay over 800k optimization steps with AdamW optimizer (Loshchilov & Hutter, 2017). We use the k-disks tokenizer with vocabulary size 384. During training, we choose a random 4-second subsequence of a WOMD scenario, a random agent state to define the coordinate frame, and a random order in which the agents are fed to the model.

For the models we analyze in all other sections, we use the same setting from above, but train to predict 64 timesteps, using only 700k optimization steps. Training on these longer scenarios enables us to evaluate longer rollouts without the complexity of extending rollouts in a fair way across models, which we do only for the WOMD Sim Agents Benchmark and document in Sec.A.5.

A.5 Extended Rollouts for WOMD Sim Agents Benchmark

In order to sample scenarios for evaluation on the WOMD sim agents benchmark, we require the ability to sample scenarios with an arbitrary number of agents arbitrarily far from each other for an arbitrary number of future timesteps. While it may become possible to train a high-performing model on 128-agent scenarios on larger datasets, we found that training our model on 24-agent scenarios and then sampling from the model using a “sliding window” (Hu etal., 2023) both spatially and temporally achieved top performance.

In detail, at a given timestep during sampling, we determine the 24-agent subsets with the following approach. First, we compute the 24-agent subset associated with picking each of the agents in the scene to be the center agent. We choose the subset associated with the agent labeled as the self-driving car to be the first chosen subset. Among the agents not included in a subset yet, we find which agent has a 24-agent subset associated to it with the maximum number of agents already included in a chosen subset, and select that agent’s subset next. We continue until all agents are included in at least one of the subsets.

Importantly, to define the order for agents within the subset, we place any padding at the front, followed by all agents that will have already selected an action at the current timestep, followed by the remaining agents sorted by distance to the center agent. In keeping this order, we enable the agents to condition on the maximum amount of pre-generated information possible. Additionally, this ordering guarantees that the self-driving car is always the first to select an action at each timestep, in accordance with the guidelines for the WOMD sim agents challenge (Montali etal., 2023).

To sample an arbitrarily long scenario, we have the option to sample up to t<T=32𝑡𝑇32t<T=32italic_t < italic_T = 32 steps before computing new 24-agent subsets. Computing new subsets every timestep ensures that the agents within a subset are always close to each other, but has the computational downside of requiring the transformer decoder key-value cache to be flushed at each timestep. For our submission, we compute the subsets at timesteps t{10,34,58}𝑡103458t\in\{10,34,58\}italic_t ∈ { 10 , 34 , 58 }.

While the performance of our model under the WOMD sim agents metrics was largely unaffected by the choice of the hyperparameters above, we found that the metrics were sensitive to the temperature and nucleus that we use when sampling from the model. We use a temperature of 1.5 and a nucleus of 1.0 to achieve the results in Tab.1. Our intuition for why larger temperatures resulted in larger values for the sim agents metric is that the log likelihood penalizes lack of coverage much more strongly than lack of calibration, and higher temperature greatly improves the coverage.

Finally, we observed that, although the performance of our model sampling with temperature 1.5 was better than all prior work for interaction and map-based metrics as reported in Tab.3, the performance was worse than prior work along kinematics metrics. To test if this discrepancy was a byproduct of discretization, we trained a “heading smoother” by tokenizing trajectories, then training a small autoregressive transformer to predict back the original heading given the tokenized trajectory. On tokenized ground-truth trajectories, the heading smoother improves heading error from 0.58 degrees to 0.33 degrees. Note that the autoregressive design of the smoother ensures that it does not violate the closed-loop requirement for the Sim Agents Benchmark. The addition of this smoother did improve along kinematics metrics slightly, as reported in Tab.3. We reserve a more rigorous study of how to best improve the kinematic realism of trajectories sampled from discrete sequence models for future work.

Method RealismMeta metric \uparrow Kinematicmetrics \uparrow Interactivemetrics \uparrow Map-basedmetrics \uparrow τ=1.25𝜏1.25\tau=1.25italic_τ = 1.25, ptop=0.995subscript𝑝top0.995p_{\mathrm{top}}=0.995italic_p start_POSTSUBSCRIPT roman_top end_POSTSUBSCRIPT = 0.9950.51760.39600.55200.6532τ=1.5𝜏1.5\tau=1.5italic_τ = 1.5, ptop=1.0subscript𝑝top1.0p_{\mathrm{top}}=1.0italic_p start_POSTSUBSCRIPT roman_top end_POSTSUBSCRIPT = 1.00.53120.39630.58380.6607 τ=1.5𝜏1.5\tau=1.5italic_τ = 1.5, ptop=1.0subscript𝑝top1.0p_{\mathrm{top}}=1.0italic_p start_POSTSUBSCRIPT roman_top end_POSTSUBSCRIPT = 1.0, w/hhitalic_h-smooth 0.53520.40650.58410.6612

A.6 December 28, 2023 - Updated Sim Agents Metrics

On December 28, 2023, Waymo announced an adjustment to the metrics for the Sim Agents benchmark to improve accuracy of vehicle and off-road collision checking (more details about this adjustment can be found here). Upon re-optimizing hyperparameters of Trajeglish for the new metrics, we found that the optimal sampling hyperparameters were τ=1.0𝜏1.0\tau=1.0italic_τ = 1.0 and ptop=1.0subscript𝑝top1.0p_{\mathrm{top}}=1.0italic_p start_POSTSUBSCRIPT roman_top end_POSTSUBSCRIPT = 1.0, which is more intuitive than our previously chosen hyperparameter of τ=1.5𝜏1.5\tau=1.5italic_τ = 1.5 given that the metrics are intended to measure the extent to which the distribution of sampled scenarios and recorded scenarios match. We then re-trained our model to condition on 32 agents at a time instead of 24 which also improved results slightly. For the final leaderboard results before the announcement of the 2024 Sim Agents Challenge, Trajeglish did end up ahead of all models it had beaten under the previous metrics, although by much slimmer margins, shown in Tab.4.

Method RealismMeta metric \uparrow Kinematicmetrics \uparrow Interactivemetrics \uparrow Map-basedmetrics \uparrow Trajeglish (τ=1.5𝜏1.5\tau=1.5italic_τ = 1.5) 0.60780.40190.72740.7682 MTR_E 0.63480.41800.74160.8400 MVTA 0.63610.41750.75430.8253 Trajeglish (τ=1.0𝜏1.0\tau=1.0italic_τ = 1.0) 0.64370.41570.78160.8213 MVTE 0.64480.42020.76660.8387 Trajeglish (τ=1.0𝜏1.0\tau=1.0italic_τ = 1.0, AA=32) 0.64510.41660.78450.8216

A.7 Additional Ablation Results

Full Control

In Fig.16, we find the sampled scenario with minimum corner distance to the ground-truth scenario and plot that distance as a function of the number of timesteps that are provided at initialization. When the initialization is a single timestep, the minADE of both models that take into account intra-timestep dependence improves. As more timesteps are provided, the effect diminishes, as expected. We visualize a small number of rollouts in the full autonomy setting in Fig.17, and videos of other rollouts can be found on our project page.

Partial Control

To quantitatively evaluate these rollouts, we measure the collision rate and visualize the results in Fig.A.8. Of course, we expect the collision rate to be high in these scenarios since most of the agents in the scene are on replay. For Trajeglish models, we find that when the autonomous agent is the first in the permutation to choose an action, they reproduce the performance of the model with no intra-timestep dependence. When the agent goes last however, the collision rate drops significantly. Modeling intra-timestep interaction is a promising way to enable more realistic simulation with some agents on replay, which may have practical benefits given that the computational burden of simulating agents with replay is minimal. In Fig.18, we visualize how the trajectory for agents controlled by Trajeglish shifts between the full autonomy setting and the partial autonomy setting. The agent follows traffic flow and cedes the right of way when replay agents ignore the actions of the agent controlled by the traffic model.

Trajeglish: Traffic Modeling as Next-Token Prediction (19)

A.8 Additional Analysis

Data and Training Statistics

We report a comparison between the number of tokens in WOMD and the number of tokens in datasets used to train GPT-1 and GPT-2 in Tab.6. Of course, a text token and a motion token do not have exactly the same information content, but we still think the comparison is worth making as it suggests that WOMD represents a dataset size similar to BookCorpus Zhu etal. (2015) which was used to train GPT-1 and the scaling curves we compute for our model shown in Fig.4.2 support this comparison. We also report the number of tokens collected per hour of driving to estimate how many hours of driving would be necessary to reach a given token count. In Tab.6, we document the extent to which using mixed precision and flash attention improves memory use and speed. Using these tools, our model takes 2 days to train on 4 A100s.

Context Length

Context length refers to the number of tokens that the model has to condition on when predicting the distribution over the next token. Intuitively, as the model is given more context, the model should get strictly better at predicting the next token. We quantify this effect in Fig.A.8. We find that the relative decrease in cross entropy from increasing the context length drops off steeply for our model for pedestrians and cyclists, which aligns with the standard intuition that these kinds of agents are more Markovian. In comparison, we find a significant decrease in cross entropy with up to 2 seconds of context for vehicles, which is double the standard context length used for vehicles on motion prediction benchmarks (Ettinger etal., 2021; Caesar etal., 2019).

Trajeglish: Traffic Modeling as Next-Token Prediction (20)\captionoffigurePartial control collision rate We plot the collision rate as a function of rollout time when the traffic model controls only one agent while the rest are on replay. We expect this collision rate to be higher than the log collision rate since the replay agents do not react to the dynamic agents. We note that the collision rate decreases significantly just by placing the agent last in the order, showing that the model learns to condition on the actions of other agents within a single timestep effectively.

Trajeglish: Traffic Modeling as Next-Token Prediction (21)\captionoffigureContext Length We plot the negative log-likelihood (NLL) when we vary the context length at test-time relative to the NLL at full context. Matching with intuition, while pedestrians and cyclists are more Markovian on a short horizon, interaction occurs on a longer timescale for vehicles.

tokensrate (tok/hour)nuScenes3M0.85MWOMD1.5B1.2MWOMD (moving)1.1B0.88MBookCorpus (GPT-1)1B-OpenWebText (GPT-2)9B-

memoryspeed (steps/hour)no intra14.7 MiB8.9kTrajeglish (mem-efficient)7.2 MiB11.1kTrajeglish (bfloat16+flash)5.6 MiB23.0k

Trajeglish: Traffic Modeling as Next-Token Prediction (22)

Trajeglish: Traffic Modeling as Next-Token Prediction (23)

Trajeglish: Traffic Modeling as Next-Token Prediction (24)Trajeglish: Traffic Modeling as Next-Token Prediction (25)
Trajeglish: Traffic Modeling as Next-Token Prediction (26)Trajeglish: Traffic Modeling as Next-Token Prediction (27)
Trajeglish: Traffic Modeling as Next-Token Prediction (2024)

References

Top Articles
Latest Posts
Article information

Author: Reed Wilderman

Last Updated:

Views: 5882

Rating: 4.1 / 5 (72 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Reed Wilderman

Birthday: 1992-06-14

Address: 998 Estell Village, Lake Oscarberg, SD 48713-6877

Phone: +21813267449721

Job: Technology Engineer

Hobby: Swimming, Do it yourself, Beekeeping, Lapidary, Cosplaying, Hiking, Graffiti

Introduction: My name is Reed Wilderman, I am a faithful, bright, lucky, adventurous, lively, rich, vast person who loves writing and wants to share my knowledge and understanding with you.