Academia.eduAcademia.edu
SmoothI: Smooth Rank Indicators for Differentiable IR Metrics Thibaut Thonet∗1 , Yagmur Gizem Cinar†2 , Eric Gaussier3 , Minghan Li4 , and Jean-Michel Renders5 1 arXiv:2105.00942v1 [cs.IR] 3 May 2021 Univ. Grenoble Alpes, CNRS, Grenoble INP - thibaut.thonet@naverlabs.com 2 Univ. Grenoble Alpes, CNRS, Grenoble INP - yg.cinar@gmail.com 3 Univ. Grenoble Alpes, CNRS, Grenoble INP - eric.gaussier@univ-grenoble-alpes.fr 4 Univ. Grenoble Alpes, CNRS, Grenoble INP - minghan.li@univ-grenoble-alpes.fr 1 NaverLabs Europe - jean-michel.renders@naverlabs.com Abstract Information retrieval (IR) systems traditionally aim to maximize metrics built on rankings, such as precision or NDCG. However, the non-differentiability of the ranking operation prevents direct optimization of such metrics in state-of-the-art neural IR models, which rely entirely on the ability to compute meaningful gradients. To address this shortcoming, we propose SmoothI, a smooth approximation of rank indicators that serves as a basic building block to devise differentiable approximations of IR metrics. We further provide theoretical guarantees on SmoothI and derived approximations, showing in particular that the approximation errors decrease exponentially with an inverse temperature-like hyperparameter that controls the quality of the approximations. Extensive experiments conducted on four standard learning-to-rank datasets validate the efficacy of the listwise losses based on SmoothI, in comparison to previously proposed ones. Additional experiments with a vanilla BERT ranking model on a text-based IR task also confirm the benefits of our listwise approach. 1 Introduction Learning to rank [27] is a sub-field of machine learning and information retrieval (IR) that aims at learning, from some training data, functions able to rank a set of objects – typically a set of documents for a given query. Learning to rank is currently one of the privileged approaches to build IR systems. This said, one important problem faced with learning to rank is that the metrics considered to evaluate the quality of a system, and the losses they underlie, are usually not differentiable. This is typically the case in IR: popular IR metrics such as precision at K, mean average precision or normalized discounted cumulative gain, are neither continuous nor differentiable. As such, state-of-the-art optimization techniques, such as stochastic gradient descent, cannot be used to learn systems that optimize their values. To address this problem, researchers have followed two main paths. The first one consists in replacing the loss associated with a given metric by a surrogate loss which is easier to optimize. Such an approach is studied in [3, 9, 20, 39, 40, 46, 52], for example. A surrogate loss typically upper bounds the true loss and, if consistent, asymptotically (usually when the number of samples tends to infinity) behaves like it. The second solution is to identify differentiable approximations of the metrics considered. This approach was adopted in, e.g., [38, 41, 45, 50]. Typically, such approximations converge towards the true metrics when an hyper-parameter that controls the quality of the approximation tends to a given value. Both approaches define optimization problems that approximate the original problem, and both have advantages and disadvantages. One of the main advantages of surrogates losses lies in the fact that it is sometimes possible to rely on an optimization problem that is convex and thus relatively simple to solve. However, Calauzènes et al. [8, 7] have shown that convex and consistent surrogate ranking losses do not always exist, as for example for the mean average precision or the expected reciprocal rank. Furthermore, as pointed out in [4], “surrogate losses in learning to ∗ Now † Now at NaverLabs Europe. at Amazon Development Centre Scotland. 1 rank are often loosely related to the target loss or upper-bound a ranking utility function instead”. On the other hand, one of the main advantages in using a differentiable approximation of a metric is the fact that one directly approximates the true loss, the quality of the approximation being controlled by an hyperparameter and not the number of samples considered. Although the optimization problem obtained is in general nonconvex and its solution usually corresponds to a local optimum, the recent success of deep learning shows that solving non-convex optimization problems can nonetheless lead to state-of-the-art systems. We follow here this latter path and study differentiable approximations of standard IR metrics. To do so, we focus on one ingredient at the core of these metrics (as well as other ranking metrics), namely the rank indicator function. We show how one can define high-quality, differentiable approximations of the rank indicator and how these lead to good approximations of the losses associated with standard IR metrics. Our contributions are thus three-fold: • We introduce SmoothI, a novel differentiable approximation of the rank indicator function that can be used in a variety of ranking metrics and losses. • We furthermore show that this approximation, as well as the differentiable IR metrics and losses derived from it, converge towards their true counterpart with theoretical guarantees. As such, our proposal complements existing ones and extends the set of tools available for differentiable approaches to ranking. • Lastly, we empirically illustrate the behavior of our proposal on both learning to rank features and standard, text-based features. To foster reproducibility, we publicly release our source code.1 The remainder of the paper is organized as follows. Section 2 provides the background of our study. Section 3 introduces the differentiable approximation of the rank indicator we propose and Section 4 describes how to use it with standard IR metrics. We study the empirical behavior of this proposal in Section 5 and discuss the related work in Section 6. Finally, Section 7 concludes the paper. 2 Preliminaries For a given query, an IR system returns a list of ranked documents. The ranking is based on scores provided by the IR system – scores that we assume here to be strictly positive and distinct 2 and that will be denoted by S = {S1 , . . . , SN } for a list of N documents. To assess the validity of an IR system, one uses gold standard collections in which the true relevance scores of documents are known, and IR metrics that assess to which extent the IR system is able to place documents with higher relevance scores at the top of the ranked list it returns. The most popular metrics are certainly the precision at K (denoted by P@K) which measures the precision in the list of top-K documents, its extension Mean Average Precision3 (MAP), as well the Normalized Discounted Cumulative Gain at K (NDCG@K) which can take into account graded relevance judgements. P@K is the average over queries of P@Kq , defined for a given query q by: P@Kq = K 1 X relq (jr ), K r=1 (1) where jr is the rth highest document in the list of scores S (i.e., the document with the rth largest score in S), relq (j) is a binary relevance score that is 1 if document j is relevant to q and 0 otherwise. MAP is the average over queries of APq defined by: APq = PN j=1 N X 1 relq (j) K=1 relq (jK )P @Kq , (2) 1 https://github.com/ygcinar/SmoothI 2 This is not a restriction per se as one can add an arbitrary large value to the scores without changing their ranking, and ties can be broken randomly. 3 MAP has a strong dependence on recall [16], and tend to be less used in IR evaluation. 2 The normalized discounted cumulative gain at rank K, NDCG@K, is the average over queries of NDCG@Kq , defined for a given query q by: K 1 X 2relq (jr ) − 1 , (3) NDCG@Kq = q NK r=1 log2 (k + 1) where relq (j) is now a (not necessarily binary) positive, bounded relevance score for document j with respect q a query-dependent normalizing constant. to query q (higher values correspond to higher relevance) and NK The standard NDCG metric corresponds to NDCG@N [19]. We are interested here in differentiable approximations of these metrics so as to rely on state-of-the-art machine learning methods to develop IR systems optimized for P@K, NDCG@K and their extensions. The approximations we will consider are parameterized by an inverse temperature-like4 hyperparameter, called α, that controls the quality of the approximation – the approximation being more accurate when α tends to ∞ and more smooth when α approaches 0. The following definition formalizes the notion of differentiable approximations in our context. Definition 1. Consider a function f : S 7→ f (S) where S = (S1 , . . . , SN ) ∈ RN . A function f α defined on RN is said to be a differentiable approximation of the function f iff f α is differentiable wrt any Si , i ∈ {1, . . . , N }, and limα→∞ f α (S) = f (S) for all S. As the reader may have noticed, the common building block and main ingredient of the above IR metrics (Eqs. 1, 2, 3) is the relevance score of the document at any rank r, namely relq (jr ). If one can define a ”good” differentiable approximation of relq (jr ), then one disposes of a ”good” differentiable approximation of IR metrics. The goal of this paper is to introduce such differentiable approximations, while giving ”good” a precise meaning. It is easy to see that, in a list of N documents, for 1 ≤ r ≤ N , one has: relq (jr ) = N X relq (j)Ijr , (4) j=1 where Ijr is the indicator function at rank r defined by: ( 1 if j is the rth highest document in the list, r Ij = 0 otherwise. Thus, one can obtain differentiable approximations of IR metrics from a differentiable approximation of the rank indicators. We propose such an approximation in the following section. 3 Smooth Rank Indicators Before generalizing our approximation to any rank r, let us first review the top-1 case, i.e., where one only seeks the document with the largest score in S = {S1 , . . . , SN }. In this case, the true rank indicator can be expressed using the argmax operator:  1 if j = argmax Sj ′ , j ′ ∈{1,...,N } Ij1 = (5)  0 otherwise. A widespread smooth approximation of the argmax is the parameterized softmax. This latter has been employed in, e.g., [32] in the context of deep k-means clustering, in [35] in the context of neural nearest neighbor networks, as well as in [18, 30] within a Gumbel-softmax distribution employed to approximate categorical samples. We then easily obtain the smooth rank indicator for rank 1 as follows: 4 This Ij1,α = P eαSj , αSj ′ j′ e is a standard approach for continuous approximations; see, e.g., [43]. 3 SmoothI 0.13 0.08 ... 0.47 2.47 Neural model 0.03 0.24 ... 0.39 1.53 ... 5.96 Per-query input doc. representations ... ... 0.74 0.41 ... 0.16 Document scores Approximate rank indicator functions Ranking loss Figure 1: Illustration of SmoothI and its positioning in a neural retrieval system. Given a query q, the document representations {Xq,di }N i=1 are first passed through a neural model which outputs a set of scores . The scores are then processed by the SmoothI module, yielding smooth rank indicators {I r,α }K {Sdi }N r=1 i=1 up to rank K, which are ultimately used to calculate the ranking loss. which behaves, when α → +∞, as the true indicator function Ij1 . We now wish to generalize the true rank indicator formulation given in Eq. 5 to any rank r ≥ 1. This can be achieved by introducing a recursive dependency between Ijr and {Ijl }r−1 l=1 :  1 if j =  argmax Sj ′ ,    j ′ ∈{1,...,N } Ijr = ∀l<r,Ijl ′ =0    0 otherwise. The constraint ∀l < r, Ijl ′ = 0 ensures that the (r − 1) highest documents are ignored and not repeatedly selected by the argmax. Given the non-negativity assumption on the scores, the previous formulation can be equivalently expressed by integrating the constraint ∀l < r, Ijl ′ = 0 in the objective as follows: Ijr = 3.1  Qr−1 1 if j = argmax Sj ′ l=1 (1 − Ijl ′ ), j ′ ∈{1,...,N }  0 (6) otherwise. SmoothI: Generalization to Rank r Building up on Eq. 6, we propose SmoothI (pronounced “smoothie”), a formulation for smooth rank indicators Ijr,α which generalizes the parameterized softmax function by recursively eliminating the (r − 1) highest documents in the set. Formally, for any rank r ∈ {1, . . . , N } and document j ∈ {1, . . . , N }, we define Ijr,α as: Ijr,α =P eαSj j′ e Qr−1 l=1 αSj ′ (1−Ijl,α −δ) Qr−1 l=1 (1−Ijl,α ′ −δ) , (7) where δ ∈ (0, 0.5) is an additional hyperparameter which intuitively controls the mass of the distribution that is allocated to the (r −1) highest documents. A larger δ leads to further reducing the contribution of the (r −1) highest documents in the distribution at rank r. The integration of SmoothI in a neural retrieval system is depicted in Fig. 1. 4 The following theorem states that Ijr,α plays the role of a smooth, differentiable approximation of the true rank indicator Ijr : Theorem 1. For any r ∈ {1, . . . , N }, and j ∈ {1, . . . , N }: (i) Ijr,α is differentiable wrt any score in S, (ii) lim Ijr,α = Ijr . α→+∞ Proof. (Sketch) The statement (i) is straightforward to establish by observing that Ijr,α is a composition of differentiable functions. The proof of (ii) proceeds by induction over r. Let us assume that the property is true up to rank r − 1 with r > 1 (the case r = 1 is straightforward) and let us prove it is true for rank r. Let Qr−1 th us denote, for any j ′ , Sj ′ l=1 (1 − Ijl,α − δ) by Aα,r highest document and let us consider ′ j ′ . Let jr be the r a document j that is not in the set of top-(r − 1) documents, denoted as B r = {jl , 1 ≤ l < r}, and different to 0 for l < r, from jr . Then, for any arbitrary small number ǫ > 0, exploiting the convergence of Ijl,α and Ijl,α r one can show that there exist Aǫ > 0 and η > 0 s.t. for any α > Aǫ one has:   Sj α,r α,r r−1 r−1 . (1 − δ) − Ajr − Aj > Sjr (1 − ǫ − δ) Sj r S The function f (ǫ) = (1 − ǫ − δ)r−1 − Sjj (1 − δ)r−1 is continuous and verifies f (0) > 0. Thus, there exist η > 0 r α,r > η. Following a and ǫ0 > 0 s.t. for any ǫ < ǫ0 there exist Aǫ > 0 s.t for any α > Aǫ one has: Aα,r jr − Aj ′ ′ r ′ similar reasoning, we show that if j ∈ B , there exists ǫ0 and Aǫ s.t. for any ǫ < ǫ0 and any α > A′ǫ :   Sj α,r r−2 (1 − ǫ − δ) − A > S Aα,r 1 − ǫ − δ − (ǫ − δ) > η. (8) jr j jr Sj r r,α leads to the desired result. Factorizing Aα,r jr in the expression of Ij Th. 1 establishes that the smooth rank indicators we have introduced converge towards their true counterparts. We now turn to assessing the speed of this convergence. 3.2 Quality of the Approximation The following theorem states that Ijr,α is a good approximation to Ijr as the error decreases exponentially with α. In other words, it establishes that the convergence towards the true rank indicators is exponentially fast. Theorem 2. Let Smin be the smallest score in S and β the minimal ratio between scores Sj and Sj ′ when 1 Sj . Furthermore, let c = ( β+1 ) K−1 , where K is the rank at which the IR metric is min Sj > Sj ′ : β = S 2 ′ ′ j (j,j ), Sj >Sj ′ n o considered, and let γ = min δ, 0.5 − δ, (1 − δ) c−1 c+1 . If: (C1) then: α> 2K−1 [log(K − 1) − log γ] n o , Smin min 1, β−1 2 ∀r ∈ {1, . . . , K}, ∀j ∈ {1, . . . , K}, |Ijr − Ijr,α | ≤ ǫα , Smin with ǫα = (K − 1)e−α 2K−1 min{1, β−1 2 } . Proof. (Sketch) The proof proceeds by induction over r. Let us assume that the property is true up to rank r − 1 with r > 1 (the case r = 1 is direct) and let us prove it is true for rank r. As for Th. 1, let us denote, Qr−1 th for any j ′ , Sj ′ l=1 (1 − Ijl,α − δ) by Aα,r highest document and let B r be the set of (r − 1) ′ j ′ , let jr be the r α,r α,r highest documents. Then, one can show that Aj − Ajr is less than:  − Smin (1 − δ − ǫα )r−2 ((δ − ǫα ) + β(1 − δ − ǫα )), for j ∈ B r ,    r−1 ! 1 − δ + ǫ α r−1  , for j ∈ / B r , j 6= jr . β−  − Smin (1 − δ − ǫα ) 1 − δ − ǫα 5 Condition (C1) on α ensures that 1 − δ − ǫα > 0.5 and thus 1 − δ + ǫα > 0.5 and ((δ − ǫα ) + β(1 − δ − ǫα )) > 0.5 as β > 1 and δ − ǫα > 0. It also ensures that: r−1 !  β−1 1 − δ + ǫα ≥ . β− 1 − δ − ǫα 2 o n α,r r,α β−1 Smin ≤ − Thus, for any j 6= jr : Aα,r − A leads to . Factorizing Aα,r min 1, j jr jr in the expression of Ij 2r−1 2 the desired result. The case j = jr is treated directly with the same factorization. As one can notice, both the right-hand side of Condition (C1) and ǫα can theoretically be made as small as one wants by increasing α or, equivalently, rescaling the scores of the documents without changing their ranking. This directly derives from the use of the exponential function that makes the softmax behave as a true rank indicator when the document scores are spread and have high values. Note also that the proof of this theorem requires 0 < δ < 0.5, hence the condition δ ∈ (0, 0.5) mentioned in Th. 1. The above approximation error directly translates to compositions of linear combinations and Lipschitz functions of the rank indicators, used e.g. in IR metrics as will be further discussed in Section 4. Corollary 1. For K ∈ {1, . . . , N }, let I = {Ijr } and Iα = {Ijr,α } for 1 ≤ r ≤ K, 1 ≤ j ≤ N . Consider the PK P function h such that h(I; a, b) = r=1 ak g( j bj Ijr ), where g is a Lipschitz function with Lipschitz constant N ℓ, and a = {ak }K k=1 and b = {bj }j=1 are real-valued constants. Then: K X α |h(I; a, b) − h(I ; a, b)| ≤ k=1 3.3 |ak | ! N X ! |bj | ℓǫα . j=1 Gradient Stabilization in Neural Architectures In pilot experiments, we found that the recursive computation in Ijr,α (Eq. 7) could sometimes lead to numerical instability when computing its gradient with respect to the scores S. We put this on the account of the complexity of the computation graph, which results from the recursion creating multiple paths between the Ijr,α node and any score node Sj ′ . To alleviate this issue, we adopted a simple solution which consists in Qr−1 r,α applying the stop-gradient operator to l=1 (1 − Ijl,α to “prune” the computation ′ − δ) in the definition of Ij graph in the backward pass. This operator, which was used in previous works such as [47], acts as the identity function in the forward pass and sets the partial derivatives of its argument to zero in the backward pass, leading to the following slightly modified definition of Ijr,α which we use in practice: Ijr,α eαSj sg[ = P j′ e Qr−1 l=1 αSj ′ sg hQ (1−Ijl,α −δ)] l,α r−1 l=1 (1−Ij ′ −δ) i, (9) where sg[·] is the stop-gradient operator. In other words, we consider that the lower-rank smooth indicators Ijl,α (l < r) in Ijr,α are constant with respect to S. ′ 4 Application to IR Metrics Based on the proposed SmoothI, one can define losses that approximate the standard IR metrics, and that may be easily used to optimize the upstream neural IR model producing the scores S, depicted in Fig. 1. In Pas N particular, a simple approximation of P@Kq is obtained by replacing relq (jr ) with j=1 relq (j)Ijr,α in Eq. 1, leading to the following objective function: P@Kqα = K N 1 XX relq (j)Ijr,α , K r=1 j=1 6 (10) from which one obtains the following approximation of APq :   N N X X 1  relq (j)IjK,α  P @Kqα . APq = PN j=1 relq (j) K=1 j=1 (11) Similarly, the approximation for NDCG@Kq is given by: NDCG@Kqα K 1 X2 = q NK r=1 PN j=1 relq (j)Ijr,α −1 log2 (k + 1) . (12) A direct application of Corollary 1 and averaging over Q queries leads to: |P@K − P@K α | ≤ mǫα , where m is the average number of relevant documents per query: Q m= N 1 XX relq (j). Q q=1 j=1 For MAP, a direct calculation leads to: |MAP − MAPα | ≤ 2N (ǫα + ǫ2α ). Similarly, for NDCG@K, one obtains, using a Taylor expansion of the function 2x around x = 0: |NDCG@K − NDCG@K α | ≤ N ǫα . This shows that the approximations obtained for P@K, MAP and NDCG@K (and a fortiori NDCG) are of exponential quality. 5 Experiments We conducted both feature-based learning to rank and text-based IR experiments to validate SmoothI’s ability to define high-quality differentiable approximations of IR metrics, and hence meaningful listwise losses. In particular, our evaluation seeks to address the following research questions: (RQ1) How do ranking losses based on SmoothI perform for learning-to-rank IR in comparison to (a) stateof-the-art listwise losses and (b) ranking losses derived from recent differentiable sorting methods? (RQ2) Among the differentiable IR metrics derived from SmoothI, is there any metric that yields superior results? (RQ3) How efficient is SmoothI in comparison to competing approaches? (RQ4) Do neural models for text-based IR (e.g., BERT) benefit from SmoothI’s listwise loss? The remainder of this section is organized as follows. Section 5.1 describes the experimental setup of the learning-to-rank experiments. Sections 5.2 and 5.3 provide the results of the learning-to-rank IR experiments, respectively answering (RQ1) and (RQ2). Section 5.4 tackles (RQ3) by comparing the efficiency of the different learning-to-rank approaches. Finally, Section 5.5 details our experiments with BERT on text-based IR, thus addressing (RQ4). 7 ... ... ... Figure 2: Neural model used to obtain the scores for all the documents (d1 , . . . , dN ) for a query q. The network takes the query-document features xq,dk and outputs the estimated score sq,dk for each query-document pair (q, dk ). Table 1: Statistics of the learning-to-rank datasets, averaged over 5 folds. YLTR is given only for Set-1. #queries #docs MQ2007 MQ2008 WEB30K YLTR 5.1 train val test train val test 1,015 470 18,919 19,944 338 157 6,306 2,994 338 157 6,306 6,983 41,774 9,127 2,262,675 473,134 13,925 3,042 754,225 71,083 13,925 3,042 754,225 165,660 Learning-to-Rank Experimental Setup Datasets. To evaluate our approach, we conducted learning-to-rank experiments on standard, publicly available datasets, namely LETOR 4.0 MQ2007, MQ2008 and MSLR-WEB30K [37] (hereafter denoted as WEB30K ), respectively containing 1,692, 784 d an31,531 and 69,623, 15,211 and 3,771,125 documents, and the Yahoo learning-to-rank Set-1 dataset [10] (hereafter denoted as YLTR), containing 29,921 queries and 709,877 documents. In these datasets, each query-document pair is associated with a feature vector. We rely on the standard 5-fold split in train, validation and test for the LETOR collections and the standard split in train, validation and test for YLTR. 5 The statistics of the different datasets for their respective folds are detailed in Table 1. Neural model and compared losses. To rank documents, we used the same fully-connected feedforward neural network for all approaches. It is composed of an input layer followed by batch normalization, a 1024dimensional hidden layer with ReLU activation again followed by batch normalization, and an output layer that provides the score of the document given as input. Figure 2 provides a schematic view of this network. This network was trained with different listwise learning-to-rank losses for comparison: ListNET [9], ListMLE [52], ListAP [41], LambdaLoss [49], Approx [38], and SmoothI (our approach). As LambdaLoss, Approx and SmoothI can be used to approximate different IR metrics, we defined eight variants for each approach, respectively optimizing P@{1, 5, 10}, NDCG@{1, 5, 10, N }, and MAP whenever possible. ListNET, ListAP6 , Approx and SmoothI7 were implemented in Pytorch [34]. For ListMLE and LambdaLoss, we relied on the TF-Ranking library [33]. For evaluating the performance of the different approaches, we used TREC evaluation tool’s Python implementation [48]. In addition to the aforementioned listwise losses originating from the learning-to-rank community, we also considered additional losses derived from recent approaches proposed for differentiable sorting and ranking [2, 13, 17, 36] for the sake of comprehensiveness, using the same neural network as before. Although not necessarily designed originally for learning to rank, these approaches provide a natural framework for such 5 Given that YLTR does not provide a 5-fold cross-validation split, we perform 5 runs for the same split on this dataset. 6 https://github.com/almazan/deep-image-retrieval 7 https://github.com/*****/SmoothI (anonymized for review purposes) 8                                              Figure 3: NDCG performance (averaged over 5 folds) of SmoothI on MQ2007’s validation set with different α and δ. application. These works are described in more detail in Section 6. We compared in particular against NeuralSort8 [17] and SoftSort9 [36], which both propose a continuous relaxation of the sorting operator based on unimodel row-stochastic matrices; OT10 [13], which frames differentiable sorting as an optimal transport problem; and FastSort11 [2], which devised an efficient differentiable approximation based on projections onto the convex hull of permutations. We used the original implementations provided by the authors. For NeuralSort, we used the deterministic version of the approach as it was shown in [17] to perform similarly to the stochastic one. The IR metric optimized for all differentiable sorting approaches is set to NDCG12 (i.e., NDCG@N). Hyperparameters. The mini-batch size is chosen as 128, and the network parameters are optimized by Adam optimizer [21], with an initial learning rate in the range {10−2 , 10−3 }. Each model is trained with 50 epochs and the parameters (weights) leading to the lowest validation error are selected. The hyperparameters α and β for Approx are searched over {0.1, 1, 10, 102 }. Similarly, for SmoothI, a line search was performed on the hyperparameter α in the range {0.1, 1, 10, 102 }. The hyperparameter δ was simply set to 0.1 as we found in pilot experiments that this value consistently gave the best results on the validation set. This is illustrated in Fig. 3 which shows the NDCG performance on the validation set of MQ2007. We can observe that no matter what the choice of α is, setting δ = 0.1 leads to the best results (or very close to the best for α = 0.1). For NeuralSort and SoftSort the temperature is searched over {0.1, 1, 10, 102 }. The regularization strength of FastSort and OT is searched over {10−2 , 0.1, 1, 10}. 8 https://github.com/ermongroup/neuralsort 9 https://github.com/sprillo/softsort 10 https://github.com/google-research/google-research/tree/master/soft_sort 11 https://github.com/google-research/fast-soft-sort 12 Note that any other IR metric could have been used here as well, but we restrained ourselves to NDCG as we found it led to the best performance (as also observed for SmoothI, see Section 4). 9 MQ2007 Dataset MQ2008 Dataset Loss SmoothI-P@1 SmoothI-P@5 SmoothI-P@10 SmoothI-NDCG@1 SmoothI-NDCG@5 SmoothI-NDCG@10 SmoothI-NDCG Performance 0.60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 P@1 P@5 P@10 NDCG@1 NDCG@5 NDCG@10 NDCG _ MAP P@1 P@5 P@10 NDCG@1 NDCG@5 NDCG@10 NDCG _ MAP Figure 4: Learning-to-rank retrieval results of SmoothI variants (optimizing different IR metrics) on MQ2007 and MQ2008. 5.2 Comparison to Learning-to-Rank IR Baselines (RQ1) In order to tackle (RQ1), we study in this section the retrieval performance of the different learning-to-rank losses derived from SmoothI and baseline approaches. Table 2 presents the learning-to-rank results, averaged over 5 folds for MQ2007, MQ2008 and Web30K, each fold using a different random initialization, and averaged over five random initializations for YLTR. We reported the significance using a Student t-test with Bonferroni correction at 5% significance level. For space reasons, we only show in this table the best results obtained across the variants of each approach – each variant optimizing an IR metric among P@{1, 5, 10}, NDCG@{1, 5, 10, N }, and MAP. As one can notice, SmoothI is the best performing method on MQ2007, MQ2008 and Web30K, even though the difference on MQ2007 and MQ2008 was often not found to be significant, in particular with respect to Approx. On Web30K, SmoothI significantly outperforms all methods on P@1, NDCG@1, NDCG@5 and NDCG@10, and all methods but Approx on P@5 and NDCG. Approx however obtained significantly better performance on MAP. The results are more contrasted on YLTR. On the one hand, ListMLE and SmoothI are on par according to the NDCG-based metrics as they respectively obtained the best performance in terms of NDCG@{1, 5} and NDCG@{10, N }, with significant differences only at cutoff 1 and 10. On the other hand, in terms of precision-based metrics and MAP, ListMLE outperformed all other approaches except Approx. Turning to the listwise losses obtained from the differentiable sorting approaches (FastSort, OT, NeuralSort and SoftSort), we observe that these methods demonstrate competitive performance on the learning to rank task, with the exception of NeuralSort. SmoothI nonetheless outperformed all of these approaches, in particular with significant differences on Web30K for all metrics. In summary, over all the collections, we can conclude that SmoothI proves to be very competitive on learning to rank with respect to traditional listwise losses and differentiable sorting approaches. 5.3 Impact of the Optimized IR Metric (RQ2) In the previous section, we observed that the SmoothI variants overall led to very competitive results compared to the considered baselines. We now turn to investigating (RQ2) by studying how SmoothI variants perform individually. For this purpose, we plot in Fig. 4 the learning-to-rank results on MQ2007 and MQ2008 for the most important SmoothI variants13 – respectively optimizing P@{1, 5, 10} or NDCG@{1, 5, 10, N } – to show their performance with respect to every metric. Strikingly, we observe that the variant SmoothI-NDCG, which optimizes NDCG, yields the best performance, no matter which evaluation measure is considered – precisionbased or NDCG-based, and at any cutoff. Having such consistency is particularly appealing because it means that one can in general simply use SmoothI-NDCG in order to get all around good performance according to any IR metric. Looking at the other results, we can notice that metrics with higher cutoff (e.g., @10 and @N) perform the best. This could be explained by the fact that a higher cutoff leads to a wider coverage of the set of scores output by the neural model and, consequently, better gradient updates. This also confirms the ability of SmoothI to properly order the documents associated to a query, even for higher ranks. 10 P@1 P@5 P@10 NDCG@1 NDCG@5 NDCG@10 NDCG MAP MQ2007 ListNet [9] ListMLE [52] ListAP [41] LambdaLoss [49] Approx [38] FastSort [2] OT [13] NeuralSort [17] SoftSort [36] SmoothI (ours) 0.463±0.008 0.442±0.017† 0.457±0.006† 0.452±0.011† 0.479±0.015 0.461±0.010 0.451±0.014† 0.373±0.083† 0.469±0.008 0.488±0.007 0.412±0.010† 0.397±0.011† 0.405±0.010† 0.403±0.007† 0.419±0.008 0.405±0.007† 0.405±0.009† 0.326±0.073† 0.413±0.005† 0.424±0.006 0.371±0.008† 0.366±0.007† 0.369±0.007† 0.372±0.006† 0.384±0.006 0.367±0.004† 0.375±0.006† 0.299±0.067† 0.378±0.007 0.384±0.006 0.420±0.010 0.395±0.019† 0.405±0.008† 0.407±0.013† 0.430±0.017 0.413±0.012† 0.406±0.014† 0.334±0.075† 0.425±0.010 0.441±0.010 0.422±0.011† 0.405±0.016† 0.414±0.009† 0.415±0.010† 0.430±0.011 0.417±0.008† 0.414±0.012† 0.338±0.076† 0.426±0.007† 0.439±0.010 0.442±0.010† 0.431±0.014† 0.438±0.008† 0.440±0.009† 0.455±0.010 0.438±0.007† 0.441±0.011† 0.357±0.080† 0.452±0.008† 0.461±0.008 0.603±0.007† 0.594±0.009† 0.600±0.005† 0.601±0.007† 0.611±0.007 0.599±0.007† 0.602±0.007† 0.547±0.054† 0.608±0.006 0.612±0.007 0.449±0.009 0.439±0.013 0.449±0.007 0.449±0.008 0.467±0.007 0.445±0.008 0.457±0.008 0.386±0.066† 0.457±0.007 0.461±0.007 MQ2008 ListNet [9] ListMLE [52] ListAP [41] LambdaLoss [49] Approx [38] FastSort [2] OT [13] NeuralSort [17] SoftSort [36] SmoothI (ours) 0.392±0.021† 0.415±0.018† 0.420±0.014 0.441±0.022 0.457±0.023 0.430±0.012 0.431±0.014 0.350±0.022† 0.411±0.016† 0.459±0.021 0.318±0.016† 0.337±0.017† 0.330±0.016† 0.337±0.012† 0.349±0.013 0.332±0.014† 0.342±0.014† 0.293±0.014† 0.335±0.013† 0.353±0.015 0.231±0.010† 0.240±0.014† 0.240±0.012† 0.242±0.010† 0.247±0.012 0.239±0.011† 0.245±0.011 0.228±0.010† 0.245±0.010 0.249±0.011 0.339±0.017† 0.365±0.018† 0.371±0.013 0.385±0.017 0.401±0.020 0.371±0.010 0.382±0.012 0.304±0.016† 0.360±0.016† 0.402±0.018 0.422±0.019† 0.445±0.019† 0.442±0.019† 0.457±0.016† 0.471±0.017 0.450±0.014† 0.461±0.016† 0.387±0.018† 0.449±0.014† 0.477±0.019 0.468±0.019† 0.486±0.024† 0.489±0.020† 0.500±0.016† 0.513±0.020 0.496±0.016† 0.504±0.017 0.448±0.017† 0.497±0.015† 0.514±0.019 0.514±0.018† 0.526±0.021† 0.532±0.018† 0.540±0.017 0.549±0.019 0.537±0.015† 0.542±0.017 0.497±0.017† 0.534±0.016† 0.550±0.019 0.428±0.016† 0.448±0.021† 0.455±0.017† 0.467±0.014 0.478±0.018 0.461±0.012† 0.470±0.014 0.402±0.014† 0.460±0.013† 0.481±0.018 Web30K ListNet [9] ListMLE [52] ListAP [41] LambdaLoss [49] Approx [38] FastSort [2] OT [13] NeuralSort [17] SoftSort [36] SmoothI (ours) 0.694±0.004† 0.620±0.085† 0.715±0.002† 0.697±0.022† 0.767±0.003† 0.722±0.004† 0.682±0.009† 0.405±0.137† 0.724±0.002† 0.776±0.002 0.649±0.003† 0.544±0.078† 0.658±0.001† 0.617±0.033† 0.716±0.002 0.660±0.003† 0.637±0.003† 0.347±0.128† 0.669±0.000† 0.717±0.001 0.622±0.003† 0.498±0.072† 0.618±0.001† 0.565±0.040† 0.675±0.002 0.624±0.002† 0.609±0.001† 0.313±0.122† 0.635±0.000† 0.674±0.001 0.496±0.004† 0.404±0.064† 0.503±0.002† 0.497±0.015† 0.544±0.003† 0.525±0.002† 0.459±0.007† 0.286±0.096† 0.521±0.001† 0.552±0.002 0.483±0.003† 0.383±0.061† 0.483±0.002† 0.466±0.023† 0.523±0.002† 0.494±0.001† 0.456±0.003† 0.259±0.093† 0.500±0.001† 0.530±0.002 0.495±0.003† 0.376±0.060† 0.486±0.001† 0.457±0.029† 0.527±0.001† 0.498±0.002† 0.469±0.002† 0.253±0.095† 0.506±0.001† 0.532±0.001 0.741±0.002† 0.646±0.033† 0.733±0.000† 0.691±0.020† 0.754±0.001 0.738±0.001† 0.729±0.002† 0.534±0.077† 0.747±0.001† 0.754±0.001 0.593±0.002† 0.456±0.044† 0.578±0.001† 0.499±0.035† 0.624±0.001 0.585±0.002† 0.584±0.001† 0.289±0.110† 0.602±0.000† 0.616±0.002† YLTR Table 2: Leaning-to-rank retrieval results. Mean test performance ± standard error is calculated over 5 folds for MQ2007, MQ2008 and Web30K, and 5 random initializations for YLTR (as no predefined folds are available). Best results are in bold and “ † ” indicates a model significantly worse than the best one according to a paired t-test with Bonferroni correction at 5%. ListNet [9] ListMLE [52] ListAP [41] LambdaLoss [49] Approx [38] FastSort [2] OT [13] NeuralSort [17] SoftSort [36] SmoothI (ours) 0.858±0.001† 0.874±0.001 0.820±0.005† 0.868±0.002† 0.870±0.001 0.857±0.001† 0.846±0.011† 0.000±0.000† 0.861±0.000† 0.869±0.001† 0.814±0.001† 0.829±0.001 0.768±0.010† 0.822±0.002† 0.828±0.000 0.812±0.000† 0.793±0.019† 0.000±0.000† 0.814±0.000† 0.826±0.000† 0.744±0.000† 0.755±0.000 0.694±0.012† 0.747±0.003† 0.754±0.000 0.741±0.000† 0.719±0.021† 0.000±0.000† 0.741±0.000† 0.750±0.000† 0.726±0.000† 0.724±0.002† 0.685±0.004† 0.731±0.002† 0.731±0.001† 0.724±0.000† 0.710±0.010† 0.000±0.000† 0.729±0.000† 0.735±0.001 0.741±0.001† 0.746±0.002 0.686±0.010† 0.743±0.001† 0.745±0.000† 0.729±0.000† 0.719±0.018† 0.000±0.000† 0.739±0.000† 0.748±0.000 0.777±0.000† 0.783±0.001 0.709±0.014† 0.775±0.002† 0.779±0.000† 0.765±0.000† 0.747±0.024† 0.000±0.000† 0.771±0.000† 0.780±0.000† 0.857±0.000† 0.859±0.001 0.820±0.007† 0.854±0.002† 0.858±0.000 0.851±0.000† 0.842±0.012† 0.499±0.000† 0.854±0.000† 0.858±0.000 0.846±0.001† 0.858±0.000 0.782±0.014† 0.845±0.001† 0.857±0.000† 0.842±0.000† 0.820±0.022† 0.356±0.000† 0.841±0.000† 0.850±0.001† 5.4 Efficiency (RQ3) To address (RQ3), we display in Table 3 the runtime (in seconds) during training for one epoch with the different approaches and their variants. We observe that on MQ2007 and MQ2008 the runtime of most approaches is similar, with the exception of ListMLE and LambdaLoss, which are slower. On Web30k, the order of magnitude of most runtimes is comparable, with nonetheless the OT approach taking much more time and SmoothI-NDCG being slightly slower. On YLTR, ListMLE, LambdaLoss and OT are significantly less efficient. Overall, all approaches other than ListMLE, LambdaLoss and OT seem to be reasonably scalable. Among the SmoothI variants, we note that SmoothI-NDCG is slightly slower in general. Although we previously found that SmoothI-NDCG was the approach that gave the best performance (see Section 5.3), a good trade-off between efficiency and effectiveness could be achieved by considering NDCG at a higher cutoff (e.g., 20 or 50) instead of relying on NDCG@N . 13 We omit MAP here as this measure tends to be less used in IR [16]. 11 Table 3: Training runtime with different learning-to-rank losses for one epoch (in seconds). ListNET ListMLE ListAP LambdaLoss-P@1 LambdaLoss-P@10 LambdaLoss-NDCG@1 LambdaLoss-NDCG@10 LambdaLoss-NDCG Approx-P@1 Approx-P@10 Approx-NDCG@1 Approx-NDCG@10 Approx-NDCG FastSort-NDCG OT-NDCG NeuralSort-NDCG SoftSort-NDCG SmoothI-P@1 SmoothI-P@10 SmoothI-NDCG@1 SmoothI-NDCG@10 SmoothI-NDCG 5.5 MQ2007 MQ2008 Web30K YLTR 1.20 12.94 1.31 13.80 13.89 14.56 14.25 14.20 1.21 1.28 1.33 1.26 1.25 3.51 2.97 1.92 1.39 1.27 1.28 1.19 1.27 2.56 0.56 12.50 0.72 13.08 13.17 13.76 13.42 13.24 0.67 0.68 0.64 0.72 0.65 1.62 1.56 0.75 0.73 0.65 0.72 0.69 0.70 1.12 106.45 45.13 108.23 80.01 80.14 87.17 88.22 87.84 125.66 174.33 117.10 164.02 144.17 151.18 1990.37 140.52 118.59 131.58 138.41 139.54 114.52 224.70 33.98 474.29 34.51 850.40 810.83 614.98 877.28 612.50 38.58 41.24 38.08 38.79 40.26 44.20 545.23 32.09 31.78 40.72 40.32 41.82 38.81 57.94 Experiments on Text-based IR (RQ4) To further validate the efficacy of our proposed approach SmoothI and investigate (RQ4), we conducted experiments on text-based information retrieval, i.e., with raw texts as input. In particular, the task consists here in optimizing a given neural model to appropriately rank the documents for each query, where the documents and queries are raw texts. This differs from the previous sections which focus on feature-based learning to rank, i.e., where each query-document pair is represented by a feature vector. Experimental setup. TREC Robust04 is used here as the text-based IR collection, which consists of 250 queries and 0.5M documents. We use the keyword version of queries, corresponding to the title fields of TREC topics [14, 31]. We experimented with vanilla BERT [15] as the neural ranking model, as it is the core of recent state-of-the-art IR methods [14, 29, 26]. To the best of our knowledge, most text-based IR neural models are trained with a pointwise or pairwise loss [26, 29]. A challenge in this experiment was then to use a listwise loss on a BERT model. Indeed, the calculation of the loss requires that the representations of all the documents to be ranked for a query hold together in memory. Given that the whole list of documents associated to a query can be large and lead to a prohibitive memory cost from the BERT model, we adopted a simple alternative. We compute the listwise loss only on the documents of the training batch, where each batch contains two pairs of (relevant, non-relevant) documents associated to one query. We use the pretrained uncased BERT-base as our BERT ranking model and compare the pairwise hinge loss [29] against the proposed SmoothI-based NDCG loss, which proved effective in learning-to-rank experiments (see Section 5.3). We concatenate the [CLS] token, query tokens, the [SEP] token and document tokens (from one document) as BERT’s input tokens. From BERT’s output [CLS] vector, a dense layer generates the relevance score for the corresponding query-document pair. Following previous works [29], to handle documents longer than the capacity of BERT, documents are truncated to 800 tokens. The models for both the pairwise and SmoothI losses are trained for 100 epochs using the Adam optimizer with a learning rate of 2 · 10−5 for BERT and 10−3 for the top dense layer. Batch size is 4 (two pairs, to fit on a single GPU) and gradient accumulation (every 8 steps) is used. We followed a five-fold cross validation protocol [29]. The models are trained on the training set (corresponding to three folds), tuned on the validation set (one fold) with early stopping, and evaluated on the test set (the remaining fold). We use the standard re-ranking setting and 12 re-rank the top-150 documents returned by BM25 [42]. The hyperparameters α and δ for SmoothI are set to 1.0 and 0.1 respectively. Table 4: Text-based retrieval results on Robust04. Mean test performance ± standard error is calculated over 5 folds. The best results are in bold and “ † ” indicates a model significantly worse than the best one according to a paired t-test at 5%. P@1 P@5 P@10 P@20 MAP BERT (pairwise loss) BERT (SmoothI loss) BERT (pairwise loss) BERT (SmoothI loss) 0.625±0.035 0.643±0.027 0.533±0.025 0.550±0.030 0.466±0.018 0.486±0.022 0.384±0.011† 0.410±0.018 0.232±0.003† 0.241±0.006 NDCG@1 NDCG@5 NDCG@10 NDCG@20 NDCG 0.581±0.032 0.598±0.025 0.524±0.022 0.535±0.023 0.489±0.016† 0.504±0.019 0.457±0.012† 0.475±0.018 0.444±0.008 0.447±0.007 Results. Table 4 reports the text-based retrieval performance, averaged over 5 folds, of the vanilla BERT model with both pairwise hinge and SmoothI-based NDCG losses. The best results are in bold and “ † ” indicates a model significantly worse than the best one according to a paired t-test at 5%. One can observe that the BERT model performs better when it is trained with the SmoothI loss. The improvement over the pairwise loss is in particular significant on P@20, MAP, NDCG@10 and NDCG@20. To be specific, the vanilla BERT model with SmoothI achieves 0.410 on P@20 and 0.475 on NDCG@20, which are the best results this model has achieved to our knowledge [29, 28]. 6 Related Work The methods discussed in this section and used as baselines in our experiments are presented in bold. Listwise approaches are widely used in IR as they directly address the ranking problem [9, 52]. A first category of methods developed for listwise learning to rank aimed at building surrogates for non-differentiable loss functions based on a ranking of the objects. In this line, RankCosine [39] used a loss function based on the cosine of two rank vectors while ListNet [9] adopted a cross-entropy loss. ListMLE and its extensions [25, 52] introduced a likelihood loss and a theoretical framework for statistical consistency (extended in [24, 23, 51]), while [20] and [3, 40, 46] studied surrogate loss functions for P@K and NDCG, respectively. Lastly, LambdaRank [5] used a logistic loss weighted by the cost, according to the targeted evaluation metric, of swapping two documents. This approach has then been extended to tree-based ensemble methods in LambdaMART [6], and finally generalized in LambdaLoss [49], the best performing method according to [49] in this family. If surrogate losses are interesting as they can lead to simpler optimization problems, they are sometimes only loosely related to the target loss, as pointed out in [4]. A typical example is the Top-K loss proposed in [1] (see also [12, 53] for a study of the relations between evaluation metrics and surrogate losses). Furthermore, using a notion of consistency based on the concept of calibration developed in [44], Calauzènes et al. [8, 7] have shown that convex and consistent surrogate ranking losses do not always exist, as for example for the mean average precision or the expected reciprocal rank. Researchers have thus directly studied differentiable approximations of loss functions and evaluation metrics – from SoftRank [45], which proposed a smooth approximation to NDCG, to the recent differentiable approximation of MAP, called ListAP, in the context of image retrieval [41]. Some of the proposed approaches are based on a soft approximation of the position function [50] or of the rank indicator [11], from which one can derive differentiable approximations of most standard IR metrics. However, [50] is specific to DCG whereas [11] assumes that the inverse of the rank function is known. Closer to our proposal is the work of [38] (referred to as Approx in our experiments) which was recently used in [4] and which makes use of the composition of two approximation functions, namely the position and the truncation functions, to obtain theoretically sound differentiable approximations of P@K, MAP, P@K and NDCG@K. In contrast, our approach makes use of a single approximation, that of the rank indicator, for all losses and metrics considered, and thus avoids in general composing the errors of different approximations.14 14 Note however that as MAP is based on a composition of rank indicators, the errors of each approximation also compose. 13 More recently, different studies, mostly in the machine learning community, have been dedicated to differentiable approximations of the sorting and the rank indicators. A fundamental relation between optimal transport and generalized sorting is for example provided in [13], with an approximation based on Sinkhorn quantiles. This is the approach referred to as OT in our experiments (note that [54] also exploits optimal transport for listwise document ranking, without however proving that the approximation used is correct). [2] have focused on devising fast approximations of the sorting and ranking functions by casting differentiable sorting and ranking as projections onto the the convex hull of all permutations, an approach referred to as FastSort in our experiments. Closer to our proposal – as they are also considering rank indicators – are the studies presented in [35, 17, 36], mostly for K-NN classification. [35] propose a recursive formulation of an approximation of the rank indicator that bears similarities with ours. However, no theoretical guarantees are provided, neither for this approximation nor for the K-NN loss it is used in. A more general framework, based on unimodal row-stochastic matrices, is used in [17], in which an approximation of the sorting operator, referred to as NeuralSort in our experiments, is introduced. It can be shown that the N × N matrix Iα = {Ijr,α }1≤r≤N,1≤j≤N is a unimodal row-stochastic matrix, so that our proposal can be used in their framework as well. [36] further improved the above proposal by simplifying it, an approach referred to as SoftSort. Lastly, we want to mention the approach developed by [22] who propose an adaptive projection method, called Rankmax, that projects, in a differentiable manner, a score vector onto the (n, k)-simplex. This method is particularly well adapted to multi-class classification. Its application to IR mettrics remains however to be studied. 7 Conclusion We presented in this study a unified approach to build differentiable approximations of IR metrics (P@K, MAP and NDCG@K) on the basis of an approximation of the rank indicator function. We further showed that the errors associated with these approximations decrease exponentially with an inverse temperaturelike hyperparameter that controls the quality of the approximations. We also illustrated the efficacy and efficiency of our approach on four standard collections based on learning-to-rank features, as well as on the popular TREC Robust04 text-based collection. All in all, our proposal, referred to as SmoothI, constitutes an additional tool for differentiable ranking that proved highly competitive compared with previous approaches on several collections, either based on learning to rank or textual features. We also want to stress that the approach we proposed is nevertheless more general and can directly be applied to other losses, such as the K-NN loss studied in [17], and functions that are directly based on the rank indicator. Among such functions, we are particularly interested in the ranking function, which aims at ordering the documents in decreasing order of their scores, the sorting function, which aims at ordering the scores, and the position function, which aims at providing, for each document, its rank in the ordered list of scores. We plan to study, on the basis of the development given in this paper, differentiable approximations of these functions in a near future. References [1] Leonard Berrada, Andrew Zisserman, and M. Pawan Kumar. Smooth loss functions for deep top-k classification. In Proceedings of the 6th International Conference on Learning Representations, 2018. [2] Mathieu Blondel, Olivier Teboul, Quentin Berthet, and Josip Djolonga. Fast Differentiable Sorting and Ranking. In Proceedings of the 37th International Conference on Machine Learning, pages 950–959, 2020. [3] Sebastian Bruch. An alternative cross entropy loss for learning-to-rank. arXiv:1911.09798, 2019. [4] Sebastian Bruch, Masrour Zoghi, Michael Bendersky, and Marc Najork. Revisiting approximate metric optimization in the age of deep neural networks. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2019. [5] Christopher J. C. Burges, Robert Ragno, and Quoc Viet Le. Learning to rank with nonsmooth cost functions. In Proceedings of the 21st Annual Conference on Neural Information Processing Systems, 2007. 14 [6] Christopher J. C. Burges, Krysta M. Svore, Paul N. Bennett, Andrzej Pastusiak, and Qiang Wu. Learning to rank using an ensemble of lambda-gradient models. Journal of Machine Learning Research, 2011. [7] Clément Calauzènes and Nicolas Usunier. On ranking via sorting by estimated expected utility. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems, 2020. [8] Clément Calauzènes, Nicolas Usunier, and Patrick Gallinari. On the (non-)existence of convex, calibrated surrogate losses for ranking. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems, pages 197–205, 2012. [9] Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: From pairwise approach to listwise approach. In Proceedings of the 24th International Conference on Machine Learning, pages 129–136, 2007. [10] Olivier Chapelle and Yi Chang. Yahoo! learning to rank challenge overview. In Proceedings of the 2010 International Conference on Yahoo! Learning to Rank Challenge, page 1–24, 2010. [11] Olivier Chapelle and Mingrui Wu. Gradient descent optimization of smoothed information retrieval metrics. Information Retrieval, 13(3):216–235, 2010. [12] Wei Chen, Tie-Yan Liu, Yanyan Lan, Zhiming Ma, and Hang Li. Ranking measures and loss functions in learning to rank. In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems, pages 315–323, 2009. [13] Marco Cuturi, Olivier Teboul, and Jean-Philippe Vert. Differentiable ranking and sorting using optimal transport. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems, 2019. [14] Zhuyun Dai and Jamie Callan. Deeper text understanding for ir with contextual neural language modeling. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 985–988, 2019. [15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics), pages 4171–4186, 2019. [16] Norbert Fuhr. Some common mistakes in ir evaluation, and how they can be avoided. SIGIR Forum, 51(3), 2017. [17] Aditya Grover, Eric Wang, Aaron Zweig, and Stefano Ermon. Stochastic Optimization of Sorting Networks via Continuous Relaxations. In Proceedings of the 7th International Conference on Learning Representations, 2019. [18] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In Proceedings of the 5th International Conference on Learning Representations, 2017. [19] Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4), 2002. [20] Purushottam Kar, Harikrishna Narasimhan, and Prateek Jain. Surrogate functions for maximizing precision at the top. In Proceedings of the 32nd International Conference on Machine Learning, pages 189–198, 2015. [21] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, 2015. [22] Weiwei Kong, Walid Krichene, Nicolas Mayoraz, Steffen Rendle, and Li Zhang. Rankmax: An adaptive projection alternative to the softmax function. In Proceedings of the 34th Annual Conference on Neural Information Processing Systems, 2020. 15 [23] Yanyan Lan, Jiafeng Guo, Xueqi Cheng, and Tie-Yan Liu. Statistical consistency of ranking methods in a rank-differentiable probability space. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems, pages 1241–1249, 2012. [24] Yanyan Lan, Tie-Yan Liu, Zhiming Ma, and Hang Li. Generalization analysis of listwise learning-to-rank algorithms. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 577–584, 2009. [25] Yanyan Lan, Yadong Zhu, Jiafeng Guo, Shuzi Niu, and Xueqi Cheng. Position-aware ListMLE: A sequential learning process for ranking. In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence, pages 449–458, 2014. [26] Canjia Li, Andrew Yates, Sean MacAvaney, Ben He, and Yingfei Sun. Parade: Passage representation aggregation for document reranking. arXiv:2008.09093, 2020. [27] Tie-Yan Liu. Learning to rank for information retrieval. Springer, 2011. [28] Xinyu Ma, Jiafeng Guo, Ruqing Zhang, Yixing Fan, Xiang Ji, and Xueqi Cheng. Prop: Pre-training with representative words prediction for ad-hoc retrieval. arXiv:2010.10137, 2020. [29] Sean MacAvaney, Andrew Yates, Arman Cohan, and Nazli Goharian. Cedr: Contextualized embeddings for document ranking. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1101–1104, 2019. [30] Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. In Proceedings of the 5th International Conference on Learning Representations, 2017. [31] Ryan McDonald, George Brokos, and Ion Androutsopoulos. Deep relevance ranking using enhanced document-query interactions. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1849–1860, 2018. [32] Maziar Moradi Fard, Thibaut Thonet, and Eric Gaussier. Deep k-means: Jointly clustering with k-means and learning representations. arXiv:1806.10069, 2018. [33] Rama Kumar Pasumarthi, Sebastian Bruch, Xuanhui Wang, Cheng Li, Michael Bendersky, Marc Najork, Jan Pfeifer, Nadav Golbandi, Rohan Anil, and Stephan Wolf. Tf-ranking: Scalable tensorflow library for learning-to-rank. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, page 2970–2978, 2019. [34] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems, pages 8024–8035. 2019. [35] Tobias Plötz and Stefan Roth. Neural nearest neighbors networks. In Proceedings of the 32nd Conference on Neural Information Processing Systems, 2018. [36] Sebastian Prillo and Julian Eisenschlos. Softsort: A continuous relaxation for the argsort operator. In Proceedings of the 37th International Conference on Machine Learning, pages 7793–7802, 2020. [37] Tao Qin and Tie-Yan Liu. Introducing LETOR 4.0 datasets. arXiv:1306.2597, 2013. [38] Tao Qin, Tie-Yan Liu, and Hang Li. A General Approximation Framework for Direct Optimization of Information Retrieval Measures. Information Retrieval, 13(4), 2010. [39] Tao Qin, Xu-Dong Zhang, Ming-Feng Tsai, De-Sheng Wang, Tie-Yan Liu, and Hang Li. Query-level loss functions for information retrieval. Information Processing & Management, 44(2):838–855, 2008. 16 [40] Pradeep Ravikumar, Ambuj Tewari, and Eunho Yang. On NDCG consistency of listwise ranking methods. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pages 618–626, 2011. [41] Jérôme Revaud, Jon Almazán, Rafael S. Rezende, and César Roberto de Souza. Learning with average precision: Training image retrieval with a listwise loss. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, pages 5106–5115, 2019. [42] Stephen E Robertson and Steve Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 232–241, 1994. [43] Kenneth Rose, Eitan Gurewitz, and Geoffrey Fox. A deterministic annealing approach to clustering. Pattern Recognition Letters, 11(9):589–594, 1990. [44] Ian Steinwart. How to compare different loss functions and their risks. Constructive Approximation, 26(2):225–287, 2007. [45] Michael Taylor, John Guiver, Stephen Robertson, and Tom Minka. SoftRank: Optimizing non-smooth rank metrics. In Proceedings of the 1st International Conference on Web Search and Data Mining, page 77–86, 2008. [46] Hamed Valizadegan, Rong Jin, Ruofei Zhang, and Jianchang Mao. Learning to rank by optimizing NDCG measure. In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems, 2009. [47] Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, pages 6309–6318, 2017. [48] Christophe Van Gysel and Maarten de Rijke. Pytrec eval: An extremely fast python interface to trec eval. In Proceedings of the 41st International ACM SIGIR conference on Research and Development in Information Retrieval, 2018. [49] Xuanhui Wang, Cheng Li, Nadav Golbandi, Michael Bendersky, and Marc Najork. The LambdaLoss framework for ranking metric optimization. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, 2018. [50] Mingrui Wu, Yi Chang, Zhaohui Zheng, and Hongyuan Zha. Smoothing DCG for learning to rank: A novel approach using smoothed hinge functions. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, page 1923–1926, 2009. [51] Fen Xia, Tie-Yan Liu, and Hang Li. Statistical consistency of top-k ranking. In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems, pages 2098–2106, 2009. [52] Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning to rank: theory and algorithm. In Proceedings of the 25th International Conference on Machine Learning, 2008. [53] Jun Xu, Tie-Yan Liu, Min Lu, Hang Li, and Wei-Ying Ma. Directly optimizing evaluation measures in learning to rank. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 107–114, 2008. [54] Haitao Yu, Adam Jatowt, Hideo Joho, Joemon M. Jose, Xiao Yang, and Long Chen. WassRank: Listwise document ranking using optimal transport theory. In Proceedings of the 12th ACM International Conference on Web Search and Data Mining, pages 24–32, 2019. 17