Publications
2026
- WSDM 2026TRUE: A Reproducible Framework for LLM-Driven Relevance Judgment in Information RetrievalMouly Dewan, Jiqun Liu, and Chirag Shah2026Accepted at WSDM 2026
LLM-based relevance judgment generation has become a crucial approach in advancing evaluation methodologies in Information Retrieval (IR). It has progressed significantly, often showing high correlation with human judgments as reflected in LLMJudge leaderboards \citerahmani2025judging. However, existing methods for relevance judgments, rely heavily on sensitive prompting strategies, lacking standardized workflows for generating reliable labels. To fill this gap, we reintroduce our method, \textitTask-aware Rubric-based Evaluation (TRUE), for relevance judgment generation. Originally developed for usefulness evaluation in search sessions, we extend TRUE to mitigate the gap in relevance judgment due to its demonstrated effectiveness and reproducible workflow. This framework leverages iterative data sampling and reasoning to evaluate relevance judgments across multiple factors including intent, coverage, specificity, accuracy and usefulness. In this paper, we evaluate TRUE on the TREC DL 2019, 2020 and LLMJudge datasets and our results show that TRUE achieves strong performance on the system-ranking LLM leaderboards. The primary focus of this work is to provide a reproducible framework for LLM-based relevance judgments, and we further analyze the effectiveness of TRUE across multiple dimensions.
2025
- ARXIVLLM-Driven Usefulness Judgment for Web Search EvaluationMouly Dewan, Jiqun Liu, Aditya Gautam, and 1 more authorarXiv preprint arXiv:2504.14401, 2025In submission to WSDM 2026
Evaluation is fundamental in optimizing search experiences and supporting diverse user intents in Information Retrieval (IR). Traditional search evaluation methods primarily rely on relevance labels, which assess how well retrieved documents match a user’s query. However, relevance alone fails to capture a search system’s effectiveness in helping users achieve their search goals, making usefulness a critical evaluation criterion. In this paper, we explore an alternative approach: LLM-generated usefulness labels, which incorporate both implicit and explicit user behavior signals to evaluate document usefulness. We propose Task-aware Rubric-based Usefulness Evaluation (TRUE), a rubric-driven evaluation method that employs iterative sampling and reasoning to model complex search behavior patterns. Our findings show that (i) LLMs can generate moderate usefulness labels by leveraging comprehensive search session history incorporating personalization and contextual understanding, and (ii) fine-tuned LLMs improve usefulness judgments when provided with structured search session contexts. Additionally, we examine whether LLMs can distinguish between relevance and usefulness, particularly in cases where this divergence impacts search success. We also conduct an ablation study to identify key metrics for accurate usefulness label generation, optimizing for token efficiency and cost-effectiveness in real-world applications. This study advances LLM-based usefulness evaluation by refining key user metrics, exploring LLM-generated label reliability, and ensuring feasibility for large-scale search systems.
- SIGIR 2025LLM-driven usefulness labeling for IR evaluationMouly Dewan, Jiqun Liu, and Chirag ShahIn Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2025
In the information retrieval (IR) domain, evaluation plays a crucial role in optimizing search experiences and supporting diverse user intents. In the recent LLM era, research has been conducted to automate document relevance labels. These labels have traditionally been assigned by crowd-sourced workers, a process that is both time consuming and costly. This study focuses on LLM-generated usefulness labels, a crucial evaluation metric that considers the user’s search intents and task objectives, an aspect where relevance falls short. Our experiment utilizes task-level, query-level, and document-level features along with user search behavior signals, which are essential in defining the usefulness of a document. Our research finds that (i) pre-trained LLMs can generate moderate usefulness labels by understanding the comprehensive search task session, and (ii) pre-trained LLMs perform better judgment in short search sessions when provided with search session contexts. Furthermore, we investigate whether LLMs can capture the unique divergence between relevance and usefulness, along with conducting an ablation study to identify the most critical metrics for accurate usefulness label generation. In conclusion, this work explores LLM-generated usefulness labels by evaluating critical metrics and optimizing for practicality in real-world settings.
2024
- EMNLP 2024ClaimVer: Explainable Claim-Level Verification and Evidence Attribution of Text Through Knowledge GraphsPreetam Prabhu Srikar Dammu, Himanshu Naidu, Mouly Dewan, and 4 more authorsIn Findings of the Association for Computational Linguistics: EMNLP 2024, Nov 2024
In the midst of widespread misinformation and disinformation through social media and the proliferation of AI-generated texts, it has become increasingly difficult for people to validate and trust information they encounter. Many fact-checking approaches and tools have been developed, but they often lack appropriate explainability or granularity to be useful in various contexts. A text validation method that is easy to use, accessible, and can perform fine-grained evidence attribution has become crucial. More importantly, building user trust in such a method requires presenting the rationale behind each prediction, as research shows this significantly influences people’s belief in automated systems. Localizing and bringing users’ attention to the specific problematic content is also paramount, instead of providing simple blanket labels. In this paper, we present \textitClaimVer, a human-centric framework tailored to meet users’ informational and verification needs by generating rich annotations and thereby reducing cognitive load. Designed to deliver comprehensive evaluations of texts, it highlights each claim, verifies it against a trusted knowledge graph (KG), presents the evidence, and provides succinct, clear explanations for each claim prediction. Finally, our framework introduces an attribution score, enhancing applicability across a wide range of downstream tasks.
- ASIS&T 2024Mind Over Misinformation: Investigating the Factors of Cognitive Influences in Information AcceptanceMouly Dewan and Chirag ShahProceedings of the Association for Information Science and Technology, Nov 2024
Many scholars have tried to investigate the identification, characterization, dissemination, and prevention of misinformation in recent years. But a fundamental question that lies behind these investigations is ‘Why do people believe in a piece of information, whether true or false?’. The primary objective of this study is to understand the psychological drivers of belief systems that makes individuals believe any information prior assessing its veracity. The study specifically tries to understand cognitive biases that influence an individual’s decision making about information in digital settings. Based on a quantitative survey with 41 participants, we try to induce cognitivity among the participants and try to measure the effect in their decision making. We find a major portion of our participants being cognitively induced which in turn had a significant effect on their decision making while engaging with information. Furthermore, we try to assess whether an individual’s political affiliation has any effects in perceived truthfulness while engaging with political information. This study shows us how easily cognitive bias can be induced and how it affects an individual’s belief structure in digital platforms.
2021
- Springer 2021Tolerating soft errors with horizontal-vertical-diagonal-n-queen (HVDNQ) parityMuhammad Sheikh Sadi, Sumaiya Sumaiya, Mouly Dewan, and 1 more authorJournal of Electronic Testing, Nov 2021
A new error detection and correction methodology, defined as Horizontal-Vertical-Diagonal-N-Queen-Parity (HVDNQ), is proposed in this paper. This approach relies on five different types of parities: horizontal parity, vertical parity, forward diagonal parity, backward diagonal parity, and queen parity. This method works on an N X N cell area and can correct multi-bit upsets. The experimental analysis validates the effectiveness of the proposed methodology by comparing its efficiency with existing methodologies. In different varieties of error patterns such as equilateral triangle, pentagon, hexagon etc., the capability of error detection and correction of HVDNQ is much better than existing methods.
2019
- IEEE 2019Soft error tolerance using horizontal, vertical, diagonal and seven queen parityMouly Dewan, Muhammad Sheikh Sadi, and othersIn 2019 IEEE International Conference on Signal Processing, Information, Communication & Systems (SPICSCON), Nov 2019
Soft error is a form of error that has an incorrect signal or datum. It is one of the biggest reliability challenges in the usage of electronic devices and several safety critical applications. In embedded systems, soft errors often cause failure in run time. Embedded system which has high complexity is vulnerable to soft errors. In a consolidated memory bit-cell area, one bit upset that can ruin a solitary bit-cell can now corrupt the contiguous regions also. A large number of bit upsets happen when data is passed starting with one end then onto the next. To address this issue, several techniques are introduced to detect multiple errors and increase the rectification rate. To ensure against soft errors, a high level process for identification and correction of errors of has been proposed. The method is known as HVD7Q (Horizontal-Vertical-Diagonal-7-Queen-Parity). For each row, column, forward and backward diagonal and queen parity line, this approach relies on parities in all the 5 dimensions. This method works on a 7x7 cell area and can correct up to 5 bit upsets in horizontal, vertical, forward diagonal and backward diagonal. The analysis demonstrates the legitimacy of this current methodology’s usefulness by contrasting out its productivity from current methodologies.