Extending Input Contexts of Language Models through Training on Segmented Sequences (2024)

Petros Karypis
UC San Diego
pkarypis@ucsd.edu\ANDJulian McAuley
UC San Diego
jmcauley@ucsd.edu &George Karypis
University of Minnesota
karypis@umn.edu

Abstract

Effectively training language models on long inputs poses many technical challenges.As a cost consideration, languages models are pretrained on a fixed sequence length before being adapted to longer sequences.We explore various methods for adapting models to longer inputs by training on segmented sequences and an interpolation-based method for extending absolute positional embeddings. We develop a training procedure to extend the input context size of pretrained models with no architectural changes and no additional memory costs than training on the original input lengths. By sub-sampling segments from long inputs while maintaining their original position the model is able to learn new positional interactions. Our method benefits both models trained with absolute positional embeddings, by extending their input contexts, as well as popular relative positional embedding methods showing a reduced perplexity on sequences longer than they were trained on. We demonstrate our method can extend input contexts by a factor of $4\times$ while improving perplexity.

Extending Input Contexts of Language Models through Training on Segmented Sequences

Petros KarypisUC San Diegopkarypis@ucsd.edu

Julian McAuleyUC San Diegojmcauley@ucsd.eduGeorge KarypisUniversity of Minnesotakarypis@umn.edu

1 Introduction

Transformer-based modelsVaswani etal. (2017) capture sequence information through positional embeddings (PE). There are two types of PEs: absolute and relative. Absolute positional embeddings (APE) learn a separate embedding for each position in a sequence; these embeddings are added to the input of the first layer. Relative positional embeddings (RPE) encode the relative distance between positions, often by weighting attention score of positions further away less.

The ability for models to process long sequences efficiently is of growing importance as models becomemore capable. Increased input context allows for more complex in-context learning examplesLi etal. (2023a); Sun etal. (2023).Additionally, they allow for question answering and summarization over scientific papers and patentsDasigi etal. (2021); Koh etal. (2022); Sharma etal. (2019).Due to RPE’s positional information only being a function of relative distance these methods can be applied to any input sequence length. In practice, popular RPE methods fail to generalize to sequences longer than they were trained on. Furthermore, self-attention’s memory cost is quadratic meaning training on long sequences becomes prohibitively expensive as the sequence length grows.

In this work, we study the problem of extending the input context of pre-trained decoder-only transformer-based models, considering those that use either absolute or relative positional embeddings.We show that an interpolation-based approach allows APE models to extrapolate to sequence lengths longer then they were trained on—matching or outperforming the extrapolation ability of RPE methods like ALiBiPress etal. (2021) and RoPESu etal. (2021). To further improve the ability of these models to take advantage of the longer input context, we present resource-efficient methods that continuously pre-train APE- and RPE-based models on carefully sampled segmented sub-sequences of long sequences. Doing so simulates training on long sequences while remaining within a fixed input length. This allows the models to efficiently learn the embeddings of the newly created absolute positions or the relative embeddings associated with the longer pairwise distances.

Extending Input Contexts of Language Models through Training on Segmented Sequences (1)

We experiment with models trained with APEs, RoPE, and ALiBi to verify our method improves the extrapolation performance independent of the choice of positional embeddings. Results show that interpolating the embedding matrix of absolute positional embeddings without any additional training allows for extrapolation to sequences $5\times$ the original input context. Furthermore, our segment-based methods are able to increase the extrapolation ability of all positional embedding approaches.When applied to APEs this method achieves 87% the performance of training on sequences twice as long at no extra memory footprint.

The paper is organized as follows: first, we conduct a review of various existing literature that motivated our approach. Second, we formally define the problem of length extrapolation and propose our methods for efficiently extending a model’s input context. Third, we provide a detailed breakdown of our experimental setup and methodology to enable reproducibility. Finally, we present our results along with a thorough discussion and analysis.

2 Related Work

2.1 Positional embeddings

Language is inherently sequential and Transformers are positional-agnostic, to account for this, positional information is often introduced to the architecture. The original authorsVaswani etal. (2017) suggested adding a positional embedding to the input of the first layer and offered two methods, absolute positional embeddings and sinusoidal embeddings. Absolute positional embeddings consist of a learnable embeddings matrix where each embedding corresponds to a position. While common, this method has an important limitation: it only allows for a fixed maximum input length determined during training. Sinusoidal embeddings did not have this limitation but performed worse in practice and the relative embeddings that came after were difficult to parallelizeShaw etal. (2018) leading to APEs being the de facto method in early models, eg. BERTDevlin etal. (2019) and GPT-3Brown etal. (2020).

To address the limited input context size of APE researchers explored other relative positional embedding methodsChi etal. (2022); Wennberg and Henter (2021); Likhomanenko etal. (2021); Haviv etal. (2022). Most notable are rotary embeddings (RoPE)Su etal. (2021), T5Raffel etal. (2019), and ALiBiPress etal. (2021). RoPE rotates the query and the key embeddings as a function of their position; this method allowed for easier parallelization compared to previous relative embeddings. T5 biasRaffel etal. (2019) adds a positional embedding for each relative distance instead of absolute position. ALiBi subtracts a linear bias from the query-key matrix product in the attention calculations.While T5 bias extrapolated to long contexts well it is too inefficient to scale, taking twice as long to train as sinusoidalPress etal. (2021). RoPE and ALiBi have been widely adopted in various LLMs with LLaMATouvron etal. (2023), GPT-JWang and Komatsuzaki (2021), and PaLMChowdhery etal. (2022) using RoPE and BLOOMScao etal. (2022) using ALiBi.

2.2 Length generalization

The choice of positional embeddings (PE) has been documented to be one of the leading factors in a Transformer based model’s ability to generalize to variable sequence lengths. The authors of ALiBiPress etal. (2021) identified that RoPE and sinusoidal embeddings failed to generalize on sequence lengths greater then those they were trained on. Numerous new positional embedding methods with more favorable length generalization abilities have been proposedSun etal. (2022); Chi etal. (2022); Li etal. (2023b) but these are required to be incorporated during pre-training.

There is a sizable body of work on methods for extending the input context of language models pre-trained with RoPEChen etal. (2023); Jin etal. (2024); Peng etal. (2023); Ding etal. (2024). These approaches map the positional information of long sequences into ranges seen during training through positional interpolation. In practice, these methods requires fine-tuning the models on long sequences to adjust to the new granularity of relative positional distance which is computational expensive.

2.3 Computationally efficient training

Numerous works have explored efficiency based modifications to the standard Transformer architectureXiong etal. (2021b); Choromanski etal. (2020); Kitaev etal. (2020); Qiu etal. (2019). These methods either modify the base architecture or rely on fast self-attention approximations.

While these methods all aim to reduce the memory cost of the Transformer architecture and allow for training on longer sequences, our work is orthogonal to these methods. Our approach can be used in conjunction with these existing methods since we do not rely on any specific architecture. We instead change the positional information of the input sequences.

2.4 Sparse input sequences

A number of works have explored training language models on sparse inputs.APEs have been shown to overfit to certain positions. To address this, Kiyono etal. (2021) proposed randomly padding or offsetting the positions during fine-tuning. This simple method led to better downstream performance on question answering and machine translationTao etal. (2023) and general length extensionZhu etal. (2023); Ruoss etal. (2023).Another work proposed Forgetful Causal Masking (FCM)Liu etal. (2022), a simple modification to the next token prediction task with a randomly selected fraction of previous tokens masked out. They demonstrated this method led to improvements in both few-shot and fine-tuned performance compared to standard causal masking. Most similar to ours, RandomPosRuoss etal. (2023) proposed sampling randomized, ordered positional embeddings to replace the sequential positional embeddings normally used. They sampled from a range of absolute positions much longer than the input sequence length. Results demonstrated this led to an increase in extrapolation performance. The authors argued this was due to exposure to longer relative pair-wise distances than those normally seen during training.

These results indicate that not only can language models be trained with heavily obfuscated sequences but can also benefit from doing so in some cases. This idea is the intuition behind our method.

3 Methods

There are three reasons that motivate this work. First, there exist numeroushigh-quality pre-trained modelswhose input context is limited to 1K–2K tokens. Extending the inputcontext of these models will further increase their applicability. Second, eventhough methods that rely on relative positional embeddings can operate on inputcontexts that are longer than what they were trained on, their out-of-the-boxextrapolation performance is not goodPress etal. (2021). Third, due to the quadratic complexity of self-attention and the linear compute/memory complexity of transformers w.r.t.sequence length, direct training on long input contexts is resource intensive. This limits the input context that we can directly train on.

3.1 Problem Statement

Let $p_{\theta}$ be a transformer-based language model trained to maximize thenext-token-probabilities over a set of sequences $\cal D$ of length $L_{t}$ ; i.e.,

\arg\max_{\theta}\sum_{\mathbf{x}\in\cal D}\sum_{i}^{L_{t}}\log p_{\theta}(x_{%i}|x_{<i}).

(1)

We will refer to $L_{t}$ as the model’s training input context length.

We define extrapolation as the language model’s ability to improve itsnext-token-prediction by using input contexts that are longer than those it trainedon. Specifically, for $k>L_{t}$ , we will consider that a model can extrapolatesuccessfully if

\sum_{i\geq k}\log p_{\theta}(x_{i}|x_{>k})>\sum_{i\geq k}\log p_{\theta}(x_{i%}|x_{>L_{t}}),

where $p_{\theta}(x_{i}|x_{>j})=p_{\theta}(x_{i}|x_{i-1},\ldots,x_{i-j+1})$ . In practice, we consider the average perplexity on sequences of different lengths from the same dataset a suitable proxy for this.

Given $p_{\theta}$ and $L_{t}$ , the problem that we want to solve is to developresource efficient methods that allow $p_{\theta}$ to extrapolate to input contexts of length $L_{e}$ that are longer than $L_{t}$ . We refer to $L_{e}$ as the extended inputcontext length.

3.2 Extending APE via interpolation

APEs learn an embedding vector for each position up to a pre-specified maximumposition. The fixed nature of the embedding matrix does not allow for inputs longerthan the maximum pre-specified length. A necessary first step when training on longersequences is to increase the size of the embedding matrix.

We use linear interpolation to extend the embedding matrix from the training inputcontext length $L_{t}$ to the new input context length $L_{e}$ Dehghani etal. (2023). Let $E$ and $E^{\prime}$ be the old and new embeddingmatrices, respectively and assume that $\beta=L_{e}/L_{t}$ is integral. Then theembedding for position $i$ ( $0\leq i<L_{e})$ is given by:

e^{\prime}_{i}=\frac{\beta-i\%\beta}{\beta}e_{\lfloor i/\beta\rfloor}+\frac{i%\%\beta}{\beta}e_{\lfloor i/\beta\rfloor+1},

where ‘%’ is the modulo operation. This process retains the original embeddings but results in $\beta(L_{t}-1)+1$ embeddings. In practice, we set the remaining $\beta-1$ embeddings to $e_{L_{t}}$ .

3.3 Efficient input context extension

Pairwise attention is the mechanism by which transformer models incorporateinformation from other tokens. Positional embeddings are how attention takes intoaccount the absolute or relative positions of the token-pairs. To fully takeadvantage of an increased input context, a model needs to learn the embeddings of thenewly created absolute positions or the relative embeddings associated with thelonger pairwise distances created with the increased input context. Thus, the modelneeds to be further pre-trained with input sequences that also include the newpositions—in the case of absolute positional embeddings, or the longer pairwisedistances—in the case of relative positional embeddings.

The key insight behind our efficient approaches is that we can meet the aboverequirements without directly training on long input sequences. Instead, we createshort input sequences by sampling segments from the long sequences, keep theoriginal positional information, concatenate them, and use them to further pre-trainthe language model.Since this approach retains the original positionalinformation, the models see the new positions/distances and learn how to use them.Though the length of the short sequence is a hyper-parameter of our approach, in allof our experiments we keep it the same as that of the original input context length;i.e., $L_{t}$ .

We develop two different subsequence sampling approaches that we refer to aschunk and prefix which are defined as follows:

•
chunk- $\alpha$ : This approach creates a short sequence by samplinga small number of equal-length contiguous subsequences from the long sequence.Specifically, given $0<\alpha<1$ and an $L_{e}$ -long input sequence $\mathbf{x}$ ,this approach samples $1/\alpha$ contiguous non-overlapping subsequences oflength $\alpha L_{t}$ from $\mathbf{x}$ . The reason that we keep the sampledsegments contiguous is to preserve the local context information, which isimportant for next-token predictionXiong etal. (2021a) and we do not want our model to‘unlearn’ it.
•
prefix- $\alpha$ : This approach creates a short sequence byrandomly sampling a set of tokens that forms a prefix and a contiguous segment toform its associated suffix. Specifically, given $0<\alpha<1$ and an inputsequence $\mathbf{x}$ of length $L_{e}$ , it randomly selects an index $i$ with $(1-\alpha)L_{t}<i<L_{e}-\alpha L_{t}$ . It creates the suffix by taking the $\alpha L_{t}$ contiguous tokens starting at position $i$ and creates the prefix byrandomly sampling $(1-\alpha)L_{t}$ tokens form the positions preceding $i$ .In this method we only compute the loss over the continuous suffix in order topreserve the model’s ability to incorporate local context.

A visualization of the different sampling methods can be found in Figure 1.

While these methods can introduce discontinuities in the causal language modelingobjective we argue that maintaining their original positional embedding on top of thefact they happen infrequently limits the harm they may cause. In practice we use $\alpha$ ’s small enough that discontinuities occurs approximately $2\%$ of the time in chunk and never in prefix.

4 Experimental setup

4.1 Dataset

Since we are comparing the performance of various methods on long sequences we chose to use the scientific papers section of the arXiv dataset released byCohan etal. (2018). Scientific papers are a common choice for reporting results on long sequence modeling performanceBeltagy etal. (2020).This dataset consists of 215K scientific papers, split into 205K train and 7K test, with a total token count of approximately 1.6 billion and an average document length of 4,938 tokens. We do not pack our batchesKosec etal. (2021), meaning each sequence contains only text from a single document at a time.If documents are longer than $L_{e}$ we split them into non-overlapping sequences with length $L_{e}$ and discard the remainder; if documents are shorter than $L_{e}$ we discard them as well. We feel that ensuring each input only corresponds to one source text is an important factor when reporting performance on long sequences.

4.2 Models

To evaluate our methods we fine-tune three different classes of pretrained language models, one for each of the popular positional embedding methods: absolute, RoPE, and ALiBi. We use models with approximately 1.5 billion parameters; for absolute positional embeddings we use GPT-2Radford etal. (2019), for rotary embeddings we use Pythia Biderman etal. (2023), and for ALiBi we use BloomScao etal. (2022). In addition to these three models we use a smaller GPT-2 and Pythia checkpoint (approx. 10% the size), which we will refer to as GPT-2 Small and Pythia Small, and together as our development models. Due to a lack of small models trained with ALiBi we do not have a development model for ALiBi. Key information about these models can be found in Table1.Note that besides the positional encoding schemes, these models also differ in other ways including training data and model parameters. As a result, a direct comparison of these models will be confounded by these additional factors. For this reason our evaluation only focuses on measuring how the different continuous pre-training approaches help in improving each model’s extrapolation capabilities against themselves and we never compare across models.

	# of params	PE	$L_{t}$
GPT-2 Small	170M	APE	1024
Pythia Small	140M	RoPE	2048
GPT-2	1.64B	APE	1024
Pythia	1.4B	RoPE	2048
Bloom	1.45B	ALiBi	2048

4.3 Domain adaptation

The perplexity on arXiv for these models is relatively high as arXiv is considered out of domain.In order to differentiate between gains attributed to adapting to the domain versus improving extrapolation performance we perform one full epoch of continual pre-training with a sequence length of $L_{t}$ for each model.

We refer to the checkpoints after domain-adaptation as "out-of-the-box" models.All experiments start from the OOTB models unless otherwise mentioned. The perplexity of the models after domain adaptation can be found in Table2.

	ppl.
GPT-2 Small	9.311
Pythia Small	8.609
GPT-2	6.675
Pythia	6.677
Bloom	7.217

4.4 Segmented pre-training

For training we use the causal language modeling objective with a cross entropy loss. All experiments on the same model are done in a compute equivalent manner unless stated otherwise. To ensure compute equivalence when training our models we fix the number of tokens as well as the input length, $L_{t}$ , of the model.

Due to segmentation, one epoch of training on different sequence lengths results in a different number of tokens actually processed. For example, training with sequences of length $2L_{t}$ results in half the total number of tokens. To ensure an equal number of tokens across experiments we set the total number of epochs for each experiment to be:

\mbox{\# epochs}=\frac{L_{e}}{L_{t}}.

(2)

4.5 Performance assessment

To evaluate the performance of our models on different sequence lengths we report the mean perplexity on sequence of length $L_{e}$ from our test set. Perplexity measures the exponentiated average negative log likelihood over a sequence of tokens and is a common evaluation metric for language models. We define the perplexity of a sequence of tokens x of length $L_{t}$ as:

ppl(x)=\exp(-\frac{1}{L_{t}}\sum_{i}\log p_{\theta}(x_{i}|x_{<i})).

(3)

Note that unlike previous work, we do not perform sliding window evaluationBaevski and Auli (2018).

5 Results

We conduct our experiments and present results in such a way to answer the following questions:

•
How well do absolute positional embeddings extrapolate with interpolation of the embeddings matrix?
•
Which of our proposed subsequence sampling methods performs the best and with what parameters?
•
How does our approach compare with continual pre-training on sequences of the original length?

5.1 Out-of-the-box extrapolation

We begin by examining each model’s ability to extrapolate to sequences longer then they were trained on without any further pre-training. We report the perplexity on the test set with sequence lengths starting from $L_{t}$ up to $5L_{t}$ , depending on the memory constraints of each. Previous length extrapolation work did not include absolute positional embeddings due to their fixed naturePress etal. (2021). To increase the input context size we interpolated the positional embedding matrix as described in Section 3.2. Results are shown in Figure2 and the corresponding numbers can be found in Table6 in AppendixA.

Extending Input Contexts of Language Models through Training on Segmented Sequences (2)

RoPE fails to extrapolate to sequences longer than originally trained on while ALiBi generalizes well. These findings about RPEs agree with those previously observed inPress etal. (2021). Our results show that interpolation works well until at least $5L_{t}$ . This suggests that with linear interpolation APEs generalize better than RoPE and are comparable to ALiBi.

5.2 Comparison of segmented methods

We compare the performance of the various methods discussed in Section3.3 on our development models. We train models on two separate extension sizes, $L_{e}=2L_{t}$ and $L_{e}=4L_{t}$ . For each we use chunk with $\alpha=\{0.125,0.25,0.5\}$ and prefix with $\alpha=\{0.25,0.5\}$ . Furthermore, we train a models on sequences of $2L_{t}$ and $2L_{t}$ without any segmentation. We refer to these models as full, and they provide a point of comparison between our methods versus training on the full $L_{e}$ sequence. The complete set of results can be found in Table3.

The different segment-based methods work well to extend the input context of these models. We observe a decrease in perplexity when evaluating on sequences longer then originally trained on. Overall, chunk performs better than prefix on both models, prefix fails to improve extrapolation when extending RoPE to sequences $4\times$ in length. While the full approach has the lowest perplexity in most cases the relative loss in performance for chunk is low. One notable case is extending RoPE to $4L_{t}$ , there we observe chunk outperforming full. Given that chunk requires half the sequence length of full it remains a competitive option due to its memory efficiency.

Comparing the performance of different chunk lengths, controlled by the parameter $\alpha$ , both models display similar trends. For chunk, there appears to be sweet-spot between the number of segments and each segment’s individual length (see Table3). An $\alpha$ of 0.125 translates to chunks of 128 tokens for APE and 256 for RoPE. In most cases this $\alpha$ performed the worst amongst chunk, as the segments may be too short or lead to too many discontinuities in the sequence. For prefix, there is less of a concrete pattern. This could be due to the higher level of randomness in the prefix as tokens were sampled randomly. Between chunk and prefix, chunk computes loss over twice as many tokens, this could be a contributing factor to the gap in performance between the two.

Between RoPE and APE, RoPE benefits the most from segmented pre-training.After training on segmented sequences the perplexity on extensions of $2L_{t}$ and $4L_{t}$ decreases by a factor of $4\times$ and $24\times$ respectively. While our method still improves over the "out-of-the-box" performance of APEs, interpolation is a competitive approach for length extension.

	method	$2L_{t}$	$4L_{t}$
APE	OOTB	9.322	13.275
	full	8.287	7.819
	chunk-0.125	8.521	8.307
	chunk-0.25	8.471	7.989
	chunk-0.5	8.420	8.259
	prefix-0.25	8.757	8.826
	prefix-0.5	8.672	9.304
RoPE	OOTB	30.686	176.244
	full	7.403	7.353
	chunk-0.125	7.476	7.239
	chunk-0.25	7.447	7.210
	chunk-0.5	7.461	7.461
	prefix-0.25	9.543	25.539
	prefix-0.5	10.119	33.375

5.3 Results on larger models

Based off the findings in Section5.2 we use chunk- $0.25$ for our experiments on GPT-2 1.5B, Pythia-1.4B, and, Bloom-1.1B. As before, we continually pre-train the models as detailed in Section4.4 and expand to $L_{e}=2L_{t}$ and $L_{e}=4L_{t}$ .

Overall, chunk works for all three models on both expansion lengths. All models extrapolated better than their "out-of-the-box" performance. Again, RoPE was able to extrapolate to sequences it previously was not able to. Our method also demonstrated the ability to further increase the extrapolation ability of ALiBi. Results can be found in Table4.

	method	$2L_{t}$	$4L_{t}$
APE	OOTB	6.326	7.099
	DA	6.125	7.050
	chunk-0.25	6.314	6.425
RoPE	OOTB	16.428	52.644
	DA	16.285	50.652
	chunk-0.25	5.448	5.278
ALiBi	OOTB	7.295	7.773
	DA	6.887	7.417
	chunk-0.25	6.773	7.295

5.4 Comparison with further pre-training

Given that ALiBi and APE-based models already extrapolate well (see Figure2), a natural question is whether the performance gains on longer sequences come from our segmented method or additional domain adaption. To ablate this, we perform another epoch of domain adaptation as described in Section4.3. This isolates the benefit of our method versus further domain adaptation as the total number of tokens seen by all models are the same. Results can be found in Table4.

For models that extrapolate well (ALiBi and APE), further domain adaptation also improves the extrapolation ability however the gains are less than our segmented training. The exception here is when extending APE to lengths $2\times$ , in this case domain adaption performs slightly better. This result indicates that the interpolation-based extension method we propose works well for APEs. Overall, this demonstrates that while some of the gains may be due to further domain adaptation our method is still beneficial for models that extrapolate well "out-of-the-box".

5.5 Comparison with RandomPos

The authors of RandomPosRuoss etal. (2023) proposed a similar method for simulating training on long sequences within a fixed input context window. Instead of subsampling sequences of length $L_{e}$ , RandomPos randomized the positional ids of sequences of length $L_{t}$ selecting positions ranging from $[0,L_{e}-1]$ while maintaining the causal ordering. Similar to our approach, RandomPos exposes the model to extrapolated pairwise relative distances but the key difference is content used. Whereas RandomPos only presents local context to the model, chunk exposes the model to distant content and encourages the model to learn to leverage distant contexts.

To verify the exposure to distant content is an important step in improving extrapolation we implement a version of RandomPos and extend our models to $2\times$ and $4\times$ the original input sizes. We keep all settings and models the same as Section5.2 with the exception of including the ALiBi model. In all cases, chunk outperforms RandomPos indicating the inclusion of distant context valuable to length extrapolation. Results can be found in Table5.

	method	$2L_{t}$	$4L_{t}$
APE	OOTB	9.322	13.275
	RandomPos	9.018	11.534
	chunk-0.25	8.420	7.989
RoPE	OOTB	30.686	176.244
	RandomPos	8.021	11.692
	chunk-0.25	7.447	7.210
ALiBi	OOTB	7.295	7.773
	DA	6.816	7.352
	chunk-0.25	6.773	7.295

6 Analysis

Our results demonstrate that segmented training is a viable approach to extend the input context size of language models. It is not immediately intuitive why, especially given that the relative positional embeddings methods are not learned.

For absolute positional embeddings the reasoning is fairly straightforward. First, in Section5.1 we demonstrated interpolating the embedding matrix led to reasonable extrapolation without any training. Before any training occurs the model already has some extrapolation ability. The segmented sequences allow for positions further away than the input size normally allows to interact and learn how to incorporate information.

In the case of relative positional embedding methods these results are less intuitive. Both RPE methods penalize the attention scores of positions as a function of their relative distance, meaning that initially there is not much attention across chunk boundaries. We hypothesize that through training on segmented sequences the model learns to attend to longer-range interactions. There is a lack of nearby positions for the model to attend to so it learns to incorporate information from further away. In doing so it adjusts the weights to penalize further positions less. This counteract-acts the RPE’s inductive bias towards nearby positions.

To attempt to visualize this we plot the distribution of median attention weights for positions past $L_{t}$ . In both cases, the medians are well below the mean suggesting that a few positions account for the majority of the attention weight. After segmented training, we observe the average median increases as well as become more evenly distributed. This suggests that more positions are being attended to as well as the model attending to more or less positions depending on the context. The plot can be found in Figure3. This hypothesis is also supported by a recent work that analyzes the failure of RoPE to generalize to long sequencesXiong etal. (2023). The observed that simply reducing the decaying effect of RoPE distant tokens lead to strong extrapolation performance.

Extending Input Contexts of Language Models through Training on Segmented Sequences (3)

7 Conclusion

In this work we proposed a simple and memory efficient approach to extend the effective input context size of models through training on sequences created by sampling segments from long documents. We demonstrated our method is robust to the choice of positional embeddings and allows models to be trained on sequences at least $4\times$ their original input length. Furthermore, our results on extending absolute positional embeddings through interpolation demonstrated they can extrapolate better than RoPE and provide a method to extend the context of models trained with APEs at no additional cost.

8 Limitations

In this work we explore various computationally efficient methods for pre-training on long sequences.Due to the compute limitations we only verify our method’s performance on models up to 1.4 billion parameters.Current state of the art models are orders of magnitudes larger. While our results indicate the success of our method there is always the chance that results do not transfer to different model sizes. We believe these methods will hold as model size increases since the extrapolation problem is fundamentally an artifact of the positional embeddings and not model size. Additionally, the models we used were originally only trained with a maximum sequence length up to 2048 tokens and only extended to a maximum 8192 tokens. Even though this is a $4\times$ extension, this is much lower then the input size of some production models.

Inline with previous work on encoding positional informationPress etal. (2021); Su etal. (2021), we use perplexity as our method for evaluating a model’s extrapolation performance. Some recent work has shown that this may not always be a strong signal for downstream performanceShaham etal. (2022). A more thorough evaluation on downstream benchmarks would be insightful, unfortunately the majority of our models were too weak to produce competitive performance on zero-shot or few-shot long sequence tasks.

9 Ethics statement

When working with language models and large, web-crawled datasets it is important to remain cognizant of some of the potential ethical concerns. We trained on scientific papers which are voluntarily posted by users.

References

Baevski and Auli (2018)Alexei Baevski and Michael Auli. 2018.Adaptive input representations for neural language modeling.ArXiv, abs/1809.10853.
Beltagy etal. (2020)IzBeltagy, MatthewE. Peters, and Arman Cohan. 2020.Longformer: The long-document transformer.ArXiv, abs/2004.05150.
Biderman etal. (2023)StellaRose Biderman, Hailey Schoelkopf, QuentinG. Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, MohammadAflah Khan, Shivanshu Purohit, USVSNSai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar vander Wal. 2023.Pythia: A suite for analyzing large language models across training and scaling.ArXiv, abs/2304.01373.
Brown etal. (2020)TomB. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T.J. Henighan, Rewon Child, Aditya Ramesh, DanielM. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020.Language models are few-shot learners.ArXiv, abs/2005.14165.
Chen etal. (2023)Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. 2023.Extending context window of large language models via positional interpolation.ArXiv, abs/2306.15595.
Chi etal. (2022)Ta-Chung Chi, Ting-Han Fan, PeterJ. Ramadge, and AlexanderI. Rudnicky. 2022.Kerple: Kernelized relative positional embedding for length extrapolation.ArXiv, abs/2205.09921.
Choromanski etal. (2020)Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamás Sarlós, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, LucyJ. Colwell, and Adrian Weller. 2020.Rethinking attention with performers.ArXiv, abs/2009.14794.
Chowdhery etal. (2022)Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, HyungWon Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, YiTay, NoamM. Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, BentonC. Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier García, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, AndrewM. Dai, ThanumalayanSankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Díaz, Orhan Firat, Michele Catasta, Jason Wei, KathleenS. Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov,and Noah Fiedel. 2022.Palm: Scaling language modeling with pathways.ArXiv, abs/2204.02311.
Cohan etal. (2018)Arman Cohan, Franck Dernoncourt, DooSoon Kim, Trung Bui, Seokhwan Kim, W.Chang, and Nazli Goharian. 2018.A discourse-aware attention model for abstractive summarization of long documents.In North American Chapter of the Association for Computational Linguistics.
Dasigi etal. (2021)Pradeep Dasigi, Kyle Lo, IzBeltagy, Arman Cohan, NoahA. Smith, and Matt Gardner. 2021.A dataset of information-seeking questions and answers anchored in research papers.ArXiv, abs/2105.03011.
Dehghani etal. (2023)Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, IbrahimM. Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, GamaleldinF. Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Collier, AlexeyA. Gritsenko, Vighnesh Birodkar, CristinaNader Vasconcelos, YiTay, Thomas Mensink, Alexander Kolesnikov, Filip Paveti’c, Dustin Tran, Thomas Kipf, Mario Luvci’c, Xiaohua Zhai, Daniel Keysers, Jeremiah Harmsen, and Neil Houlsby. 2023.Scaling vision transformers to 22 billion parameters.ArXiv, abs/2302.05442.
Devlin etal. (2019)Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019.Bert: Pre-training of deep bidirectional transformers for language understanding.ArXiv, abs/1810.04805.
Ding etal. (2024)Yiran Ding, LiLyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. 2024.Longrope: Extending llm context window beyond 2 million tokens.ArXiv, abs/2402.13753.
Haviv etal. (2022)Adi Haviv, Ori Ram, Ofir Press, Peter Izsak, and Omer Levy. 2022.Transformer language models without positional encodings still learn positional information.In Conference on Empirical Methods in Natural Language Processing.
Jin etal. (2024)Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Jiang, Zirui Liu, Chia yuan Chang, Huiyuan Chen, and Xia Hu. 2024.Llm maybe longlm: Self-extend llm context window without tuning.ArXiv, abs/2401.01325.
Kitaev etal. (2020)Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020.Reformer: The efficient transformer.ArXiv, abs/2001.04451.
Kiyono etal. (2021)Shun Kiyono, Sosuke Kobayashi, Jun Suzuki, and Kentaro Inui. 2021.Shape: Shifted absolute position embedding for transformers.In Conference on Empirical Methods in Natural Language Processing.
Koh etal. (2022)HuanYee Koh, Jiaxin Ju, Ming Liu, and Shirui Pan. 2022.An empirical survey on long document summarization: Datasets, models, and metrics.ACM Computing Surveys, 55:1 – 35.
Kosec etal. (2021)Matej Kosec, Shengyu Fu, and MarioMichael Krell. 2021.Packing: Towards 2x nlp bert acceleration.ArXiv, abs/2107.02027.
Li etal. (2023a)Mukai Li, Shansan Gong, Jiangtao Feng, Yiheng Xu, Jinchao Zhang, Zhiyong Wu, and Lingpeng Kong. 2023a.In-context learning with many demonstration examples.ArXiv, abs/2302.04931.
Li etal. (2023b)Shanda Li, Chong You, Guru Guruganesh, Joshua Ainslie, Santiago Ontanon, Manzil Zaheer, SumitK. Sanghai, Yiming Yang, Sanjiv Kumar, and Srinadh Bhojanapalli. 2023b.Functional interpolation for relative positions improves long context transformers.ArXiv, abs/2310.04418.
Likhomanenko etal. (2021)Tatiana Likhomanenko, Qiantong Xu, Ronan Collobert, Gabriel Synnaeve, and Alexey Rogozhnikov. 2021.Cape: Encoding relative positions with continuous augmented positional embeddings.In Neural Information Processing Systems.
Liu etal. (2022)Hao Liu, Xinyang Geng, Lisa Lee, Igor Mordatch, Sergey Levine, Sharan Narang, and P.Abbeel. 2022.Fcm: Forgetful causal masking makes causal language models better zero-shot learners.ArXiv, abs/2210.13432.
Peng etal. (2023)Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. 2023.Yarn: Efficient context window extension of large language models.ArXiv, abs/2309.00071.
Press etal. (2021)Ofir Press, NoahA. Smith, and Mike Lewis. 2021.Train short, test long: Attention with linear biases enables input length extrapolation.ArXiv, abs/2108.12409.
Qiu etal. (2019)Jiezhong Qiu, Hao Ma, Omer Levy, Scott Yih, Sinong Wang, and Jie Tang. 2019.Blockwise self-attention for long document understanding.ArXiv, abs/1911.02972.
Radford etal. (2019)Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019.Language models are unsupervised multitask learners.
Raffel etal. (2019)Colin Raffel, NoamM. Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and PeterJ. Liu. 2019.Exploring the limits of transfer learning with a unified text-to-text transformer.ArXiv, abs/1910.10683.
Ruoss etal. (2023)Anian Ruoss, Gr’egoire Del’etang, Tim Genewein, Jordi Grau-Moya, R.Csordás, MehdiAbbana Bennani, Shane Legg, and Joel Veness. 2023.Randomized positional encodings boost length generalization of transformers.ArXiv, abs/2305.16843.
Scao etal. (2022)TevenLe Scao, Angela Fan, Christopher Akiki, Elizabeth-Jane Pavlick, Suzana Ili’c, Daniel Hesslow, Roman Castagn’e, AlexandraSasha Luccioni, Franccois Yvon, Matthias Gallé, Jonathan Tow, AlexanderM. Rush, StellaRose Biderman, Albert Webson, PawanSasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, AlbertVillanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, IzBeltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, PedroOrtiz Suarez, Victor Sanh, Hugo Laurenccon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, AitorSoroa Etxabe, AlhamFikri Aji, Amit Alfassy, Anna Rogers, ArielKreisberg Nitzav, Canwen Xu, Chenghao Mou, ChrisC. Emezue, Christopher Klamm, Colin Leong, DanielAlexander van Strien, DavidIfeoluwa Adelani, DragomirR. Radev, EduardoGonz’alez Ponferrada, Efrat Levkovizh, Ethan Kim, EyalBar Natan, FrancescoDe Toni, Gérard Dupont, Germán Kruszewski, Giada Pistilli, Hady ElSahar, HamzaBenyamina, HieuTrung Tran, Ian Yu, Idris Abdulmumin, Isaac Johnson, Itziar Gonzalez-Dios, Javier dela Rosa, Jenny Chim, Jesse Dodge, Jian Zhu, Jonathan Chang, Jorg Frohberg, JosephineL. Tobing, Joydeep Bhattacharjee, Khalid Almubarak, Kimbo Chen, Kyle Lo, Leandro von Werra, Leon Weber, Long Phan, LoubnaBen Allal, Ludovic Tanguy, Manan Dey, ManuelRomero Muñoz, Maraim Masoud, Mar’ia Grandury, Mario vSavsko, Max Huang, Maximin Coavoux, Mayank Singh, Mike Tian-Jian Jiang, MinhChien Vu, MohammadAli Jauhar, Mustafa Ghaleb, Nishant Subramani, Nora Kassner, Nurulaqilla Khamis, Olivier Nguyen, Omar Espejel, Ona deGibert, Paulo Villegas, Peter Henderson, Pierre Colombo, PriscillaA. Amuok, Quentin Lhoest, Rheza Harliman, Rishi Bommasani, Roberto L’opez, Rui Ribeiro, Salomey Osei, Sampo Pyysalo, Sebastian Nagel, Shamik Bose, ShamsuddeenHassan Muhammad, Shanya Sharma, S.Longpre, Somaieh Nikpoor, Stanislav Silberberg, Suhas Pai, Sydney Zink, TiagoTimponi Torrent, Timo Schick, Tristan Thrush, ValentinDanchev, Vassilina Nikoulina, Veronika Laippala, Violette Lepercq, Vrinda Prabhu, Zaid Alyafeai, Zeerak Talat, Arun Raja, Benjamin Heinzerling, Chenglei Si, Elizabeth Salesky, SabrinaJ. Mielke, WilsonY. Lee, Abheesht Sharma, Andrea Santilli, Antoine Chaffin, Arnaud Stiegler, Debajyoti Datta, Eliza Szczechla, Gunjan Chhablani, Han Wang, Harsh*t Pandey, Hendrik Strobelt, JasonAlan Fries, Jos Rozen, Leo Gao, Lintang Sutawika, MSaiful Bari, MagedS. Al-shaibani, Matteo Manica, NihalV. Nayak, Ryan Teehan, Samuel Albanie, Sheng Shen, Srulik Ben-David, StephenH. Bach, Taewoon Kim, Tali Bers, Thibault Févry, Trishala Neeraj, Urmish Thakker, Vikas Raunak, Xiang Tang, ZhengXin Yong, Zhiqing Sun, Shaked Brody, YUri, Hadar Tojarieh, Adam Roberts, HyungWon Chung, Jaesung Tae, Jason Phang, Ofir Press, Conglong Li, Deepak Narayanan, Hatim Bourfoune, Jared Casper, Jeff Rasley, Max Ryabinin, Mayank Mishra, Minjia Zhang, Mohammad Shoeybi, Myriam Peyrounette, Nicolas Patry, Nouamane Tazi, Omar Sanseviero, Patrickvon Platen, Pierre Cornette, PierreFranccois Lavall’ee, Rémi Lacroix, Samyam Rajbhandari, Sanchit Gandhi, Shaden Smith, Stéphane Requena, Suraj Patil, Tim Dettmers, Ahmed Baruwa, Amanpreet Singh, Anastasia Cheveleva, Anne-Laure Ligozat, Arjun Subramonian, Aur’elie N’ev’eol, Charles Lovering, DanielH Garrette, DeepakR. Tunuguntla, Ehud Reiter, Ekaterina Taktasheva, Ekaterina Voloshina, Eli Bogdanov, GentaIndra Winata, Hailey Schoelkopf, Jan-Christoph Kalo, Jekaterina Novikova, JessicaZosa Forde, Xiangru Tang, Jungo Kasai, Ken Kawamura, Liam Hazan, Marine Carpuat, Miruna Clinciu, Najoung Kim, Newton Cheng, Oleg Serikov, Omer Antverg, Oskar vander Wal, Rui Zhang, Ruochen Zhang, Sebastian Gehrmann, Shachar Mirkin, S.Osher Pais, Tatiana Shavrina, Thomas Scialom, Tian Yun, Tomasz Limisiewicz, Verena Rieser, Vitaly Protasov, Vladislav Mikhailov, Yada Pruksachatkun, Yonatan Belinkov, Zachary Bamberger, Zdenvek Kasner, Alice Rueda, Amanda Pestana, Amir Feizpour, Ammar Khan, Amy Faranak, AnandaSantaRosa Santos, Anthony Hevia, Antigona Unldreaj, Arash Aghagol, Arezoo Abdollahi, Aycha Tammour, Azadeh HajiHosseini, Bahareh Behroozi, BenjaminOlusola Ajibade, BharatKumar Saxena, CarlosMuñoz Ferrandis, Danish Contractor, DavidM. Lansky, Davis David, Douwe Kiela, DuongAnh Nguyen, Edward Tan, Emily Baylor, Ezinwanne Ozoani, FatimT Mirza, Frankline Ononiwu, Habib Rezanejad, H.A. Jones, Indrani Bhattacharya, Irene Solaiman, Irina Sedenko, Isar Nejadgholi, Jan Passmore, Joshua Seltzer, JulioBonis Sanz, Karen Fort, LíviaMacedo Dutra, Mairon Samagaio, Maraim Elbadri, Margot Mieskes, Marissa Gerchick, Martha Akinlolu, Michael McKenna, Mike Qiu, M.K.K. Ghauri, Mykola Burynok, Nafis Abrar, Nazneen Rajani, Nour Elkott, Nourhan Fahmy, Olanrewaju Samuel, Ran An, R.P. Kromann, Ryan Hao, Samira Alizadeh, Sarmad Shubber, SilasL. Wang, Sourav Roy, Sylvain Viguier, Thanh-Cong Le, Tobi Oyebade, Trieu NguyenHai Le, Yoyo Yang, ZacharyKyle Nguyen, AbhinavRamesh Kashyap, A.Palasciano, AlisonCallahan, Anima Shukla, Antonio Miranda-Escalada, AyushKumar Singh, Benjamin Beilharz, BoWang, Caio MatheusFonseca deBrito, Chenxi Zhou, Chirag Jain, Chuxin Xu, Clémentine Fourrier, DanielLe’on Perin’an, Daniel Molano, Dian Yu, Enrique Manjavacas, Fabio Barth, Florian Fuhrimann, Gabriel Altay, Giyaseddin Bayrak, Gully Burns, HelenaU. Vrabec, ImanI.B. Bello, Isha Dash, JiSoo Kang, John Giorgi, Jonas Golde, JoseDavid Posada, Karthi Sivaraman, Lokesh Bulchandani, LuLiu, Luisa Shinzato, MadeleineHahn deBykhovetz, Maiko Takeuchi, Marc Pàmies, MaríaAndrea Castillo, Marianna Nezhurina, Mario Sanger, Matthias Samwald, Michael Cullan, Michael Weinberg, MWolf, Mina Mihaljcic, Minna Liu, Moritz Freidank, Myungsun Kang, Natasha Seelam, Nathan Dahlberg, NicholasMichio Broad, Nikolaus Muellner, Pascale Fung, Patricia Haller, R.Chandrasekhar, R.Eisenberg, Robert Martin, RodrigoL. Canalli, Rosaline Su, Ruisi Su, Samuel Cahyawijaya, Samuele Garda, ShlokS Deshmukh, Shubhanshu Mishra, SidKiblawi, Simon Ott, Sinee Sang-aroonsiri, Srishti Kumar, Stefan Schweter, SushilPratap Bharati, T.A. Laud, Th’eo Gigant, Tomoya Kainuma, Wojciech Kusa, Yanis Labrak, Yashasvi Bajaj, Y.Venkatraman, Yifan Xu, Ying Xu, Yun chao Xu, ZheeXao Tan, Zhongli Xie, Zifan Ye, Mathilde Bras, Younes Belkada, and Thomas Wolf. 2022.Bloom: A 176b-parameter open-access multilingual language model.ArXiv, abs/2211.05100.
Shaham etal. (2022)Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong, Mor Geva, Jonathan Berant, and Omer Levy. 2022.SCROLLS: Standardized CompaRison over long language sequences.In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 12007–12021, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Sharma etal. (2019)Eva Sharma, Chen Li, and LuWang. 2019.Bigpatent: A large-scale dataset for abstractive and coherent summarization.ArXiv, abs/1906.03741.
Shaw etal. (2018)Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018.Self-attention with relative position representations.In North American Chapter of the Association for Computational Linguistics.
Su etal. (2021)Jianlin Su, YuLu, Shengfeng Pan, BoWen, and Yunfeng Liu. 2021.Roformer: Enhanced transformer with rotary position embedding.ArXiv, abs/2104.09864.
Sun etal. (2023)Simeng Sun, Y.Liu, Shuo Wang, Chenguang Zhu, and Mohit Iyyer. 2023.Pearl: Prompting large language models to plan and execute actions over long documents.ArXiv, abs/2305.14564.
Sun etal. (2022)Yutao Sun, LiDong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. 2022.A length-extrapolatable transformer.ArXiv, abs/2212.10554.
Tao etal. (2023)Mingxu Tao, Yansong Feng, and Dongyan Zhao. 2023.A frustratingly easy improvement for position embeddings via random padding.ArXiv, abs/2305.04859.
Touvron etal. (2023)Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aur’elien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023.Llama: Open and efficient foundation language models.ArXiv, abs/2302.13971.
Vaswani etal. (2017)Ashish Vaswani, NoamM. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanN. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017.Attention is all you need.In NIPS.
Wang and Komatsuzaki (2021)Ben Wang and Aran Komatsuzaki. 2021.GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model.https://github.com/kingoflolz/mesh-transformer-jax.
Wennberg and Henter (2021)Ulme Wennberg and GustavEje Henter. 2021.The case for translation-invariant self-attention in transformer-based language models.ArXiv, abs/2106.01950.
Xiong etal. (2023)Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, KarthikAbinav Sankararaman, Barlas Oğuz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Ksh*tiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, and Hao Ma. 2023.Effective long-context scaling of foundation models.ArXiv, abs/2309.16039.
Xiong etal. (2021a)Wenhan Xiong, Barlas Ouguz, Anchit Gupta, Xilun Chen, Diana Liskovich, Omer Levy, Wen tau Yih, and Yashar Mehdad. 2021a.Simple local attentions remain competitive for long-context tasks.In North American Chapter of the Association for Computational Linguistics.
Xiong etal. (2021b)Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, GlennMoo Fung, Yin Li, and Vikas Singh. 2021b.Nyströmformer: A nyström-based algorithm for approximating self-attention.Proceedings of the … AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence, 35 16:14138–14148.
Zhu etal. (2023)Dawei Zhu, Nan Yang, Liang Wang, Yifan Song, Wenhao Wu, Furu Wei, and Sujian Li. 2023.Pose: Efficient context window extension of llms via positional skip-wise training.

Appendix A Full Results

(ppl.)	$1\times$	$2\times$	$3\times$	$4\times$	$5\times$
APE	6.675	6.326	6.394	7.099	8.438
RoPE	6.677	17.348	45.797	69.288	-
ALiBi	7.217	7.295	7.653	7.773	-