Reinforcement Learning and Collusion (submitted)
Online Appendix
This paper develops an analytical framework to characterize the long run policies learned by repeatedly interacting algorithms. In the model, algorithms observe a state variable and update their policies to maximize long-term discounted payoffs. I show that their long run policies correspond to the stable equilibria of a tractable differential equation. I take advantage of this framework to analyze a repeated Bertrand game, where the stage game Nash equilibrium serves as a non-collusive benchmark. I derive necessary and sufficient conditions under which this Nash equilibrium is learned, revealing how the interplay between monitoring technology (state variables) and market conditions (price elasticities, markups) determines whether competitive or collusive outcomes emerge. Finally, I apply these insights to evaluate two key regulatory policies: limiting data inputs for algorithms and imposing competition in the software provider market. My results demonstrate that the former strategy is a more promising approach to curbing algorithmic collusion.
Strategic Communication and Algorithmic Advice
(with Emilio Calvano and Juha Tolvanen)
We study a model of communication in which a better-informed sender learns to communicate with a receiver who takes an action that affects the welfare of both. Specifically, we model the sender as a machine-learning-based algorithmic recommendation system and the receiver as a rational, best-responding agent that understands how the algorithm works. The results demonstrate robust communication, which either emerges from scratch (i.e., originating from babbling where no common language initially exists) or persists when initialized. We show that the sender's learning hinders communication, limiting the extent of information transmission even when the algorithm's designer's and the receiver's preferences are aligned. We then show that when the two are not aligned, there is a robust pattern where the algorithm plays a cut-off strategy pooling messages when its private information suggests actions in the direction of its preference bias while sending mostly separate signals otherwise.
The Bounds to Algorithmic Collusion: Q-learning, gradient learning, and the Folk Theorem
(with Galit Ashkenazi-Golan, Domenico Mergoni Cecchelli, and Edward Plumb)
We explore the behaviour emerging from learning agents repeatedly interacting strategically for a wide range of learning dynamics including Q-learning, projected gradient, replicator and log-barrier dynamics. Going beyond the better-understood classes of potential games and zero-sum games, we consider the setting of a general repeated game with finite recall, for different forms of monitoring. We obtain a Folk Theorem-like result and characterise the set of payoff vectors that can be obtained by these dynamics, discovering a wide range of possibilities for the emergence of algorithmic collusion.
Strategic Learning: When slow and steady wins the race
(with Galit Ashkenazi-Golan, Edward Plumb, and Yufei Zhang)
Learning agents are increasingly involved in decision making. When this decision making is for a strategic interaction, the question of strategically choosing the learning method emerges. We provide an initial step into understanding the implications of a strategic choice of a parameter - the speed of learning in multiagent gradient learning. We use 2x2 games to map the different considerations that are involved in choosing the speed strategically: the effect on basins of attraction, on cyclic behaviour and on the trajectory in dominance-solvable games. For the latter, we show that, while intuitively learning as fast as possible might seem to be an optimal choice, this is not always the case.
Dormant Paper
Learning to Best Reply: On the Consistency of Multi-Agent Batch Reinforcement Learning
This paper provides asymptotic results for a class of model-free actor-critic batch - reinforcement learning algorithms in the multi-agent setting. At each period, each agent faces an estimation problem (the critic, e.g. a value function), and a policy updating problem. The estimation step is done by parametric function estimation based on a batch of past observations. Agents have no knowledge of each others incentives and policies. I provide sufficient conditions for each agent's parametric function estimator to be consistent in the multi-agent environment, which enables agents to learn to best respond despite the non-stationarity inherent in multi-agent systems. The conditions depend on the environment, batch size, and policy step size.
These sufficient conditions are useful in the asymptotic analysis of multi-agent learning, e.g. in the application of long-run characterisations using stochastic approximation techniques.
Zombie Prevalence and Bank Health: Exploring Feedback Effects (R&R at Management Science)
(with Andreea Rotarescu and Kevin Song)
This paper investigates feedback effects between bank health and zombie firms—financially distressed firms receiving subsidized credit. The literature focuses on how banks create zombies, overlooking zombies’ impact on bank health. Using Spanish firm-bank data (2005-2014), we document a vicious cycle: lower bank capital ratios are associated with higher zombie activity in served industries, while higher zombie prevalence is associated with reduced bank capital. We link this to a previously unexplored mechanism where banks respond appropriately to observable financial distress through higher provisioning, but overlook risks from relationship borrowers receiving subsidized rates. Our findings suggest that this feedback stems not from financial distress alone, but from the combination of distress with interest rate subsidies.