An Ensemble Approach to Data Mining for Real-time Information Retrieval

This was originally published in September 2020

Abstract

Information retrieval (IR), locating relevant documents in a document collection based on a user's query, is a common problem in text analysis. Traditional keyword-based IR engines are good at finding relevant information, but struggle to provide semantic and contextual results for complex queries. We propose an ensemble approach, a combination of multiple Natural Language Processing (NLP) models, which transform documents into vectors, followed by scoring and ranking documents based on their relevance to the user search query. To provide an effective search mechanism over a large document corpus, we incorporate Elasticsearch in our solution. It is a popular distributed search engine which allows indexing and searching of documents in near real-time. With the retrieved documents, we generate multi-sentence summaries using an extractive text summarizer to make it easier for users to glean the relevant content. All these components are packaged into an end-to-end solution encapsulated by a Python Flask UI where the users can enter a search query and get the relevant results in near real-time. A major challenge faced in evaluating the performance of NLP-based IR models is the absence of relevance labels or scores for document-query pairs. To benchmark the performance of our ensemble approach to this problem, we use titles of the documents as queries to calculate the Mean Reciprocal Rank (MRR) as a validation metric for individual techniques and their ensemble.

Abstract

Thanks for reading. Want to continue?