Research Dashboard

Automated surveillance of arXiv for my core research tracks.

1. Kinetic AI Risk

Scope: Intersection of Large Language Models (LLM) and ICS/SCADA.

Counterintuitive problems in discrete probability

2026-06-05 | Luca Avena, Gianmarco Bet, Bernardo B...

This manuscript contains a collection of counterintuitive problems in discrete probability, together with detailed solutions. The dataset was constructed as part of a broader research project investigating the capabilities of the latest-generation Large Language Models (LLMs) in solving discrete probability...

How reliable are LLMs when it comes to playing dice?

2026-06-05 | Luca Avena, Gianmarco Bet, Bernardo B...

We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a set of counterintuitive exercises, designed to trigger heuristic reasoning,...

Agentopia: Long-Term Life Simulation and Learning in Agent Societies

2026-06-05 | Xintao Wang, Sirui Zheng, Hongqiu Wu,...

Humans learn from social life. Simulating this process with LLM-powered agents represents a promising research direction, raising a natural question: whether LLMs can learn from such simulated social experience to better understand and replicate human behavior. However, prior agent society...

Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings

2026-06-05 | Songhao Wu, Zhongxin Chen, Yuxuan Liu...

Large language models exhibit impressive zero-shot capabilities across a wide range of downstream tasks. However, they struggle to function as off-the-shelf embedding models, leading to suboptimal performance on massive text embedding benchmarks. In this paper, we identify a potential cause...

Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning

2026-06-05 | Fatema Siddika, Md Anwar Hossen, Tanw...

Continual learning in Large Language Models (LLMs) is hindered by the plasticity-stability dilemma, where acquiring new capabilities often leads to catastrophic forgetting of previous knowledge. Existing methods typically treat parameters uniformly, failing to distinguish between specific task knowledge and shared...

Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle

2026-06-05 | Jiayu Wang, Weijiang Lv, Bowen Fu, Ji...

As foundation models advance and agent scaffolding becomes increasingly sophisticated, agents have demonstrated remarkable proficiency in complex, long-horizon coding tasks and even autonomous experiment execution. Despite their evolution from research assistants into autonomous research agents, these systems still exhibit significant...

Sycophantic Praise: Evaluating Excessive Praise in Language Models

2026-06-05 | Daniel Vennemeyer, Phan Anh Duong, Me...

Sycophancy in language models is typically studied as excessive agreement or validation, while explicit praise and flattery have received comparatively little attention. We argue that sycophantic praise is a distinct alignment problem that cannot be reliably measured using current methods....

Evidence Markets

2026-06-05 | Safwan Hossain, Gabriel Andrade, Chen...

Modern prediction markets face two limitations that restrict their applicability in a range of settings:~(i)~they reveal what the crowd believes but not the evidence or reasoning behind those beliefs, and~(ii)~they require an event with an external ground truth that resolves...

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

2026-06-05 | Jiahao Meng, Yue Tan, Qi Xu, Kuan Gao...

Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence, long-range dependencies, multimodal alignment, and reliable inference...

The Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLMs

2026-06-05 | Yang Zhang, Xiao Fei, Amr Mohamed, Sa...

Large language models are increasingly used to answer culturally grounded questions across languages, yet it remains unclear whether local cultural knowledge is better accessed through English or the local language. Existing evaluations face two key limitations: many rely on parallel...

Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills

2026-06-05 | Chuan Xiao, Zhengbo Jiao, Shaobo Wang...

LLM-driven software engineering agents have become a central testbed for real-world language-model capability, yet their training remains limited by the availability of high-quality SWE tasks. Existing synthetic data methods typically create tasks through fixed mutation or bug-injection procedures, making the...

A Comprehensive Anatomy of Human and DeepSeek-R1 LLM Mathematical Reasoning

2026-06-05 | Yuxiang Chen, Jun Wang

The emergence of "Aha moments" in large language models, particularly DeepSeek-R1-0120, has raised the question of whether these systems genuinely reason or merely imitate the appearance of reasoning. We conduct a comprehensive empirical comparison between model and human reasoning across...

Online Pandora's Box for Contextual LLM Cascading

2026-06-05 | Alexandre Belloni, Yan Chen, Yehua Wei

Motivated by Large Language Model (LLM) cascading, we propose an online contextual Pandora's Box model for adaptively querying and selecting LLM APIs. In each period, a decision-maker observes a request context and faces a two-phase decision problem. In the query...

Self-evolving LLM agents with in-distribution Optimization

2026-06-05 | Yudi Zhang, Meng Fang, Zhenfang Chen,...

Large Language Models (LLMs) have recently emerged as powerful controllers for interactive agents in complex environments, yet training them to perform reliable long-horizon decision making remains a fundamental challenge. A key difficulty lies in credit assignment: agents often receive delayed...

LLM-Guided Evolution for Medical Decision Pipelines

2026-06-05 | Ivan Sviridov, Artem Oskin, Ivan Pani...

Adapting large language models (LLMs) to clinical workflows often requires costly fine-tuning or manual prompt and pipeline engineering. We study LLM-guided MAP-Elites evolution as an inference-time alternative for discovering medical decision strategies and provide an implementation repository at https://github.com/univanxx/llm_guided_evo_medical. We...

Empirical Evaluation of Large Language Models for Migration of Code Fragments to Post-Quantum Cryptography

2026-06-05 | Javier Pallarés de Bonrostro, Ana I. ...

The transition to post-quantum cryptography (PQC) requires not only replacing vulnerable cryptographic primitives, but also refactoring the surrounding software logic. While existing PQC migration frameworks provide organizational guidance, practical code-level remediation remains largely manual and error-prone. This paper evaluates whether...

2. GRC Engineering & AI Governance

Scope: AI Governance, Policy as Code, and Compliance Engineering.

From Abstract Threats to Institutional Realities: A Comparative Semantic Network Analysis of AI Securitisation in the US, EU, and China

2026-01-07 | Ruiyi Guo, Bodong Zhang

Artificial intelligence governance exhibits a striking paradox: while major jurisdictions converge rhetorically around concepts such as safety, risk, and accountability, their regulatory frameworks remain fundamentally divergent and mutually unintelligible. This paper argues that this fragmentation cannot be explained solely by...

From Slaves to Synths? Superintelligence and the Evolution of Legal Personality

2026-01-06 | Simon Chesterman

This essay examines the evolving concept of legal personality through the lens of recent developments in artificial intelligence and the possible emergence of superintelligence. Legal systems have long been open to extending personhood to non-human entities, most prominently corporations, for...

Compliance as a Trust Metric

2026-01-03 | Wenbo Wu, George Konstantinidis

Trust and Reputation Management Systems (TRMSs) are critical for the modern web, yet their reliance on subjective user ratings or narrow Quality of Service (QoS) metrics lacks objective grounding. Concurrently, while regulatory frameworks like GDPR and HIPAA provide objective behavioral...

Verifiable Off-Chain Governance

2025-12-29 | Jake Hartnell, Eugenio Battaglia

Current DAO governance praxis limits organizational expressivity and reduces complex organizational decisions to token-weighted voting due to on-chain computational limits. This paper proposes verifiable off-chain computation (leveraging Verifiable Services, TEEs, and ZK proofs) as a framework to transcend these constraints...

With Great Capabilities Come Great Responsibilities: Introducing the Agentic Risk & Capability Framework for Governing Agentic AI Systems

2025-12-22 | Shaun Khoo, Jessica Foo, Roy Ka-Wei Lee

Agentic AI systems present both significant opportunities and novel risks due to their capacity for autonomous action, encompassing tasks such as code execution, internet interaction, and file modification. This poses considerable challenges for effective organizational governance, particularly in comprehensively identifying,...

Computable Gap Assessment of Artificial Intelligence Governance in Children's Centres: Evidence-Mechanism-Governance-Indicator Modelling of UNICEF's Guidance on AI and Children 3.0 Based on the Graph-GAP Framework

2025-12-20 | Wei Meng

This paper tackles practical challenges in governing child centered artificial intelligence: policy texts state principles and requirements but often lack reproducible evidence anchors, explicit causal pathways, executable governance toolchains, and computable audit metrics. We propose Graph-GAP, a methodology that decomposes...

The Future of the AI Summit Series

2025-12-19 | Lucia Velasco, Charles Martinet, Henr...

This policy memo examines the evolution of the international AI Summit series, initiated at Bletchley Park in 2023 and continued through Seoul in 2024 and Paris in 2025, as a forum for cooperation on the governance of advanced artificial intelligence....

Smart Data Portfolios: A Quantitative Framework for Input Governance in AI

2025-12-18 | A. Talha Yalta, A. Yasemin Yalta

Growing concerns about fairness, privacy, robustness, and transparency have made it a central expectation of AI governance that automated decisions be explainable by institutions and intelligible to affected parties. We introduce the Smart Data Portfolio (SDP) framework, which treats data...

How frontier AI companies could implement an internal audit function

2025-12-16 | Francesca Gomez, Adam Buick, Leah Fer...

Frontier AI developers operate at the intersection of rapid technical progress, extreme risk exposure, and growing regulatory scrutiny. While a range of external evaluations and safety frameworks have emerged, comparatively little attention has been paid to how internal organizational assurance...