Automated surveillance of arXiv for my core research tracks.
2026-02-27 | Fan Shu, Yite Wang, Ruofan Wu, Boyi L...
The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking. There are two major gaps in existing benchmarks: (i) the lack of standardized, process-aware evaluation that captures...
2026-02-27 | Jenny Y. Huang, Leshem Choshen, Ramon...
Multi-turn interactions with large language models typically retain the assistant's own past responses in the conversation history. In this work, we revisit this design choice by asking whether large language models benefit from conditioning on their own prior responses. Using...
2026-02-27 | Weinan Dai, Hanlin Wu, Qiying Yu, Hua...
GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive with compiler-based systems such as torch.compile for CUDA kernel...
2026-02-27 | Saber Zerhoudi, Michael Granitzer
Simulating nuanced user experiences within complex interactive search systems poses distinct challenge for traditional methodologies, which often rely on static user proxies or, more recently, on standalone large language model (LLM) agents that may lack deep, verifiable grounding. The true...
2026-02-27 | Jialiang Fan, Weizhe Xu, Mengyu Liu, ...
Safety-critical task planning in robotic systems remains challenging: classical planners suffer from poor scalability, Reinforcement Learning (RL)-based methods generalize poorly, and base Large Language Models (LLMs) cannot guarantee safety. To address this gap, we propose safety-generalizable large language models, named...
2026-02-27 | Abisheka Pitumpe, Amir Rahmati
Job-based smishing scams, where victims are recruited under the guise of remote job opportunities, represent a rapidly growing and understudied threat within the broader landscape of online fraud. In this paper, we present Anansi, the first scalable, end-to-end measurement pipeline...
2026-02-27 | Saleh Afroogh, Seyd Ishtiaque Ahmed, ...
This study provides a cross-disciplinary examination of Explainable Artificial Intelligence (XAI) approaches-focusing on deep neural networks (DNNs) and large language models (LLMs)-and identifies empirical and conceptual limitations in current XAI. We discuss critical symptoms that stem from deeper root causes...
2026-02-27 | Antoine Peyronnet, Fabian Gloeckle, A...
We present a new approach for benchmarking Large Language Model (LLM) capabilities on research-level mathematics. Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for mathematical research. Instead, we establish an updatable benchmark evaluating...
2026-02-27 | Adam Dejl, Deniz Gorur, Francesca Toni
Argumentative LLMs (ArgLLMs) are an existing approach leveraging Large Language Models (LLMs) and computational argumentation for decision-making, with the aim of making the resulting decisions faithfully explainable to and contestable by humans. Here we propose a web-based system implementing ArgLLM-empowered...
2026-02-27 | James L. Zainaldin, Cameron Pattison,...
This study presents the first systematic, reference-free human evaluation of large language model (LLM) machine translation (MT) for Ancient Greek (AG) technical prose. We evaluate translations by three commercial LLMs (Claude, Gemini, ChatGPT) of twenty paragraph-length passages from two works...
2026-02-27 | Sara Nabhani, Federico Pianzola, Khal...
Can narratives make arguments more persuasive? And to this end, which narrative features matter most? Although stories are often seen as powerful tools for persuasion, their specific role in online, unstructured argumentation remains underexplored. To address this gap, we present...
2026-02-27 | Nathanael Jo, Nikhil Garg, Manish Rag...
Machine learning models -- including large language models (LLMs) -- are often said to exhibit monoculture, where outputs agree strikingly often. But what does it actually mean for models to agree too much? We argue that this question is inherently...
2026-02-27 | Jaekyung Cho
Resource-efficient training optimization techniques are becoming increasingly important as the size of large language models (LLMs) continues to grow. In particular, batch packing is commonly used in pre-training and supervised fine-tuning to achieve resource-efficient training. We propose preference packing, a...
2026-02-27 | Joon Kiat Chua, Donghao Huang, Zhaoxi...
Large language model (LLM) agents, such as OpenAI's Operator and Claude's Computer Use, can automate workflows but unable to handle payment tasks. Existing agentic solutions have gained significant attention; however, even the latest approaches face challenges in implementing end-to-end agentic...
2026-02-27 | Donghao Huang, Zhaoxia Wang
Large language models (LLMs) with reasoning capabilities have fueled a compelling narrative that reasoning universally improves performance across language tasks. We test this claim through a comprehensive evaluation of 504 configurations across seven model families--including adaptive, conditional, and reinforcement learning-based...
2026-02-27 | Ferran Agullo, Joan Oliveras, Chen Wa...
Large Language Model (LLM) adapters enable low-cost model specialization, but introduce complex caching and scheduling challenges in distributed serving systems where hundreds of adapters must be hosted concurrently. While prior work has largely focused on latency minimization, resource efficiency through...
2026-02-27 | Daniel Yang, Samuel Stante, Florian R...
Reward models are central to aligning large language models (LLMs) with human preferences. Yet most approaches rely on pointwise reward estimates that overlook the epistemic uncertainty in reward models arising from limited human feedback. Recent work suggests that quantifying this...
2026-02-27 | Abhishek Kulkarni, Sharon Lynn Chu
Interest-based learning (IBL) is a paradigm of instruction in which educational content is contextualized using learners' interests to enhance content relevance. IBL has been shown to result in improved learning outcomes. Unfortunately, high effort is needed for instructors to design...
2026-02-27 | Zhaolin Cai, Fan Li, Huiyu Duan, Liju...
Video anomaly detection (VAD) aims to identify abnormal events in videos. Traditional VAD methods generally suffer from the high costs of labeled data and full training, thus some recent works have explored leveraging frozen multi-modal large language models (MLLMs) in...
2026-02-27 | Zhicheng Fang, Jingjie Zheng, Chenxu ...
Jailbreak techniques for large language models (LLMs) evolve faster than benchmarks, making robustness estimates stale and difficult to compare across papers due to drift in datasets, harnesses, and judging protocols. We introduce JAILBREAK FOUNDRY (JBF), a system that addresses this...
2026-01-07 | Ruiyi Guo, Bodong Zhang
Artificial intelligence governance exhibits a striking paradox: while major jurisdictions converge rhetorically around concepts such as safety, risk, and accountability, their regulatory frameworks remain fundamentally divergent and mutually unintelligible. This paper argues that this fragmentation cannot be explained solely by...
2026-01-07 | Tom Deckenbrunnen, Alessio Buscemi, M...
The EU AI Act adopts a horizontal and adaptive approach to govern AI technologies characterised by rapid development and unpredictable emerging capabilities. To maintain relevance, the Act embeds provisions for regulatory learning. However, these provisions operate within a complex network...
2026-01-06 | Simon Chesterman
This essay examines the evolving concept of legal personality through the lens of recent developments in artificial intelligence and the possible emergence of superintelligence. Legal systems have long been open to extending personhood to non-human entities, most prominently corporations, for...
2026-01-03 | Wenbo Wu, George Konstantinidis
Trust and Reputation Management Systems (TRMSs) are critical for the modern web, yet their reliance on subjective user ratings or narrow Quality of Service (QoS) metrics lacks objective grounding. Concurrently, while regulatory frameworks like GDPR and HIPAA provide objective behavioral...
2025-12-29 | Jake Hartnell, Eugenio Battaglia
Current DAO governance praxis limits organizational expressivity and reduces complex organizational decisions to token-weighted voting due to on-chain computational limits. This paper proposes verifiable off-chain computation (leveraging Verifiable Services, TEEs, and ZK proofs) as a framework to transcend these constraints...
2025-12-26 | Sunil Arora, John Hastings
Natural Language Processing (NLP) systems are increasingly used in sensitive domains such as healthcare, finance, and government, where they handle large volumes of personal and regulated data. However, these systems introduce distinct risks related to security, privacy, and regulatory compliance...
2025-12-22 | Shaun Khoo, Jessica Foo, Roy Ka-Wei Lee
Agentic AI systems present both significant opportunities and novel risks due to their capacity for autonomous action, encompassing tasks such as code execution, internet interaction, and file modification. This poses considerable challenges for effective organizational governance, particularly in comprehensively identifying,...
2025-12-21 | Yang Ni, Tong Yang
Large Language Models (LLMs) and AI chatbots are increasingly used for emotional and mental health support due to their low cost, immediacy, and accessibility. However, when safety guardrails are triggered, conversations may be abruptly terminated, introducing a distinct form of...
2025-12-20 | Wei Meng
This paper tackles practical challenges in governing child centered artificial intelligence: policy texts state principles and requirements but often lack reproducible evidence anchors, explicit causal pathways, executable governance toolchains, and computable audit metrics. We propose Graph-GAP, a methodology that decomposes...
2025-12-19 | Lucia Velasco, Charles Martinet, Henr...
This policy memo examines the evolution of the international AI Summit series, initiated at Bletchley Park in 2023 and continued through Seoul in 2024 and Paris in 2025, as a forum for cooperation on the governance of advanced artificial intelligence....
2025-12-19 | Robin Schimmelpfennig, Mark Díaz, Vin...
Over a billion users across the globe interact with AI systems engineered with increasing sophistication to mimic human traits. This shift has triggered urgent debate regarding Anthropomorphism, the attribution of human characteristics to synthetic agents, and its potential to induce...
2025-12-18 | Otman A. Basir
Artificial intelligence systems are increasingly deployed in domains that shape human behaviour, institutional decision-making, and societal outcomes. Existing responsible AI and governance efforts provide important normative principles but often lack enforceable engineering mechanisms that operate throughout the system lifecycle. This...
2025-12-18 | A. Talha Yalta, A. Yasemin Yalta
Growing concerns about fairness, privacy, robustness, and transparency have made it a central expectation of AI governance that automated decisions be explainable by institutions and intelligible to affected parties. We introduce the Smart Data Portfolio (SDP) framework, which treats data...
2025-12-17 | Atte Ojanen, Johannes Anttila, Thilo ...
The rapid advancements in artificial intelligence (AI) present unique challenges for policymakers that seek to govern the technology. In this context, the Delphi method has become an established way to identify consensus and disagreement on emerging technological issues among experts...
2025-12-16 | Francesca Gomez, Adam Buick, Leah Fer...
Frontier AI developers operate at the intersection of rapid technical progress, extreme risk exposure, and growing regulatory scrutiny. While a range of external evaluations and safety frameworks have emerged, comparatively little attention has been paid to how internal organizational assurance...