# Partially-supervised machine learning for finance

Distributional reinforcement learning based on maximum likelihood estimation

## OVERVIEW

In traditional reinforcement learning (RL), agents aim at optimising state-action choices based on recursive estimation of expected values. We found that this approach fails when the period rewards (returns) are generated by a leptokurtic law, as is common in financial applications. Under leptokurtosis, outliers are common and large, causing the estimates of expected values, and hence, optimal policies, to change erratically. Distributional reinforcement learning (dis-RL) improves on this because it takes the entire distribution of outcomes into account, and hence, allows more efficient estimation of expected values.

We took this idea further and use the asymptotically most efficient estimator of expected values, namely, the maximum likelihood estimator (MLE). In addition, since in our financial context the period reward distribution and the (asymptotic) distribution of Q (optimal state-action) values are fundamentally different, with leptokurtosis affecting the former but not the latter, we estimated their means separately. We found that the resulting distributional RL (dis-RL-mle) learns much faster, and is robust once it settles on the optimal policy.