We propose the Assessor-driven Weighted Averages for Retrieval Evaluation (AWARE) probabilistic framework, a novel methodology for dealing with multiple crowd assessors, who may be contradictory and/or noisy. By modeling relevance judgements and crowd assessors as sources of uncertainty, AWARE takes the expectation of a generic performance measure, like Average Precision (AP), composed with these random variables. In this way, it approaches the problem of aggregating different crowd assessors from a new perspective, i.e. directly combining the performance measures computed on the ground-truth generated by the crowd assessors instead of adopting some classification technique to merge the labels produced by them. We propose several unsupervised estimators that instantiate the AWARE framework and we compare them with state-of-the-art approaches, i.e. Majority Vote (MV) and Expectation Maximization (EM), on TREC collections. We found that AWARE approaches improve in terms of their capability of correctly ranking systems and predicting their actual performance scores.
The Web has created a global marketplace for e-Commerce as well as for talent. Online employment marketplaces provide an effective channel to facilitate the matching between job seekers and hirers. This paper presents an initial exploration of user behavior in job and talent search using query and click logs from a popular employment marketplace. The observations suggest that the understanding of users’ search behavior in this scenario is still at its infancy and that some of the assumptions made in general web search may not hold true. The open challenges identified so far are presented.
Ranking query results effectively by considering user past behaviour and preferences is a primary concern for IR researchers both in academia and industry. In this context, LtR is widely believed to be the most effective solution to design ranking models that account for user-interaction features that have proved to remarkably impact on IR e ectiveness. In this paper, we explore the possibility of integrating the user dynamic directly into the LtR algorithms. Specifically, we model with Markov chains the behaviour of users in scanning a ranked result list and we modify LambdaMart, a state-of-the-art LtR algorithm, to exploit a new discount loss function calibrated on the proposed Markovian model of user dynamic. We evaluate the performance of the proposed approach on publicly available LtR datasets, finding that the improvements measured over the standard algorithm are statistically significant.
In this paper we explore the possibility of integrating the user dynamic directly into LambdaMart by modeling the user behaviour with Markov chains and by defining a new discount loss function calibrated on the proposed model. This approach achieves significantly better performances than standard algorithms.
In this paper, we describe a set of experiments that turn the machine learning classification task into a game, through gamification techniques, and let non expert users to perform text classification without even knowing the problem. The application is implemented in R using the Shiny package for interactive graphics. We present the outcome of three different experiments: a pilot experiment with PhD and post-doc students, and two experiments carried out with primary and secondary school students. The results show that the human aided classifier performs similarly and sometimes even better than state of the art classifiers.
To address the challenge of adapting Information Retrieval (IR) to the constantly evolving user tasks and needs and to adjust it to user interactions and preferences we develop a new model of user behavior based on Markov chains. We aim at integrating the proposed model into several aspects of IR, i.e. evaluation measures, systems and collections. Firstly, we studied IR evaluation measures and we propose a theoretical framework to describe their properties. Then, we presented a new family of evaluation measures, called Markov Precision (MP), based on the proposed model and able to explicitly link lab-style and on-line evaluation metrics. Future work will include the presented model into Learning to Rank (LtR) algorithms and will define a collection for evaluation and comparison of Personalized Information Retrieval (PIR) systems.
The participation of the Information Management System (IMS) Group of the University of Padua in the Total Recall track at TREC 2016 consisted in a set of fully automated experiments based on the two-dimensional probabilistic model. We trained the model in two ways that tried to mimic a real user, and we compared it to two versions of the BM25 model with different parameter settings. This initial set of experiments lays the ground for a wider study that will explore a gamification approach in the context of high recall situations.
The creation of a labelled dataset for Information Retrieval (IR) purposes is a costly process. For this reason, a mix of crowd-sourcing and active learning approaches have been proposed in the literature in order to assess the relevance of documents of a collection given a particular query at an affordable cost. In this paper, we present the design of the gamification of this interactive process that draws inspiration from recent works in the area of gamification for IR. In particular, we focus on three main points: i) we want to create a set of relevance judgements with the least effort by human assessors, ii) we use interactive search interfaces that use game mechanics, iii) we use Natural Language Processing (NLP) to collect different aspects of a query.
The creation of a labelled dataset for machine learning purposes is a costly process. In recent works, it has been shown that a mix of crowd-sourcing and active learning approaches can be used to annotate objects at an affordable cost. In this paper, we study the gamification of machine learning techniques; in particular, the problem of classification of objects. In this first pilot study, we designed a simple game, based on a visual interpretation of probabilistic classifiers, that consists in separating two sets of coloured points on a two-dimensional plane by means of a straight line. We present the current results of this first experiment that we used to collect the requirements for the next version of the game and to analyze i) what is the 'price' to build a reasonably accu- rate classifier with a small amount of labelled objects, ii) and compare the accuracy of the player to the state-of-the-art classification algorithms.
In this paper we present a formal framework, based on the representational theory of measurement and we define and study the properties of utility-oriented measurements of retrieval effectiveness like AP, RBP, ERR and many other popular IR evaluation measures.
In this paper we present a formal framework to define and study the properties of utility-oriented measurements of retrieval effectiveness, like AP, RBP, ERR and many other popular IR evaluation measures. The proposed framework is laid in the wake of the representational theory of measurement, which provides the foundations of the modern theory of measurement in both physical and social sciences, thus contributing to explicitly link IR evaluation to a broader context.
The proposed framework is minimal, in the sense that it relies on just one axiom, from which other properties are derived. Finally, it contributes to a better understanding and a clear separation of what issues are due to the inherent problems in comparing systems in terms of retrieval effec- tiveness and what others are due to the expected numerical properties of a measurement.
In this position paper, we discuss the issue of how to ensure reproducibility of the results when off-the-shelf open source Information Retrieval (IR) systems are used. These systems provided a great advancement to the field but they rely on many configurations parameters which are often implicit or hidden in the documentation and/or source code. If not fully understood and made explicit, these parameters may make it difficult to reproduce results or even to understand why a system is not behaving as expected.
The paper provides examples of the effects of hidden pa- rameters in off-the-shelf IR systems, describes the enabling technologies needed to embody the approach, and show how these issues can be addressed in the broader context of com- ponent based IR evaluation.
We propose a solution for systematically unfolding the configuration details of off-the-shelf IR systems and under- standing whether a particular instance of a system using is behaving as expected. The proposal requires to: 1) build a taxonomy of components used by off-the-shelf systems, 2) uniquely identify them and their combination in a given configuration, 3) run each configuration on standard test collections, 4) compute the expected performance measures for each run, 4) and publish on a Web portal all the gathered information in order to make accessible and comparable for everybody how an off-the-shelf system with a given config- uration is expected to behave.
We propose a family of new evaluation measures, called Markov Precision (MP), which exploits continuous-time and discrete-time Markov chains and we conduct a thorough experimental evaluation providing also an example of calibration of its time parameters.
To address the challenge of adapting experimental evaluation to the constantly evolving user tasks and needs, we develop a new family of Markovian Information Retrieval (IR) evaluation measures, called Markov Precision (MP), where the interaction between the user and the ranked result list is modelled via Markov chains, and which will be able to explicitly link lab-style and on-line evaluation methods.
Moreover, since experimental results are often not so easy to understand, we will develop a Web-based Visual Analytics (VA) prototype where an animated state diagram of the Markov chain will explain how the user is interacting with the ranked result list in order to offer a support for a careful failure analysis.
We present two new measures of retrieval effectiveness, inspired by Graded Average Precision (GAP), which extends Average Precision (AP) to graded relevance judgements. Starting from the random choice of a user, we define Extended Graded Average Precision (xGAP) and Expected Graded Average Precision (eGAP), which are more accurate than GAP in the case of a small number of highly relevant documents with high probability to be considered relevant by the users. The proposed measures are then evaluated on TREC 10, TREC 14, and TREC 21 collections showing that they actually grasp a different angle from GAP and that they are robust when it comes to incomplete judgments and shallow pools. <\p>
We propose a family of new evaluation measures, called Markov Precision (MP), which exploits continuous-time and discrete-time Markov chains in order to inject user models into precision. Continuous-time MP behaves like timecalibrated measures, bringing the time spent by the user into the evaluation of a system; discrete-time MP behaves like traditional evaluation measures. Being part of the same Markovian framework, the time-based and rank-based versions of MP produce values that are directly comparable. <\p>
Finally, we conduct a thorough experimental evaluation of MP on standard TREC collections in order to show that MP is as reliable as other measures and we provide an example of calibration of its time parameters based on click logs from Yandex.