预测编码:理论与实验综述 (Predictive Coding: A Theoretical and Experimental Review)
本文是对 Beren Millidge、Anil K. Seth 与 Christopher L. Buckley 的综述论文《Predictive Coding: a Theoretical and Experimental Review》的概括(Millidge et al., 2022 - Predictive Coding: a Theoretical and Experimental Review)。该综述系统回顾了预测编码(Predictive Coding)的数学架构、生物学实现、不同范式及其与现代机器学习算法的关系(同上;可配合 Bogacz, 2017 - A tutorial on the free-energy framework for modelling perception and learning;Buckley et al., 2017 - The free energy principle for action and perception: A mathematical review)。
引用约定:文中采用 “作者, 年份 - 原文标题” 给出关键出处;若某小节内容主要对应 Millidge et al. (2022) 的某章节,会在小节开头标注“对应综述章节”。
1. 引言 (Introduction)
预测编码理论提出了一种统一的大脑皮层功能解释:大脑的核心功能是最小化预测误差(Prediction Error),即预测输入与实际接收输入之间的差异(Clark, 2013 - Whatever next? predictive brains, situated agents, and the future of cognitive science;Friston, 2005 - A theory of cortical responses;Rao & Ballard, 1999 - Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects;Millidge et al., 2022 - Predictive Coding: a Theoretical and Experimental Review)。 这种最小化可以通过三种方式实现: 1. 感知(Perception):通过推断世界的隐藏状态来解释感知输入(Friston, 2005 - A theory of cortical responses;Beal et al., 2003 - Variational algorithms for approximate Bayesian inference)。 2. 学习(Learning):更新世界模型以做出更好的预测(Friston, 2003 - Learning and inference in the brain;Neal & Hinton, 1998 - A view of the EM algorithm that justifies incremental, sparse, and other variants)。 3. 行动(Action):通过行动对世界采样,使其符合预测(即主动推理 Active Inference)(Friston et al., 2009 - Reinforcement learning or active inference?;Friston, 2010 - The free-energy principle: a unified brain theory?)。
预测编码的思想源远流长,其先驱包括 Helmholtz 的“无意识推理”(Perception as Unconscious Inference)和 Barlow 的“最小冗余原理”(Minimum Redundancy Principle)(Helmholtz, 1866 - Concerning the perceptions in general;Barlow, 1961 - The coding of sensory messages)。Mumford (1992) 将“模板匹配 + 只上传残差”的直觉推广为皮层架构层面的理论雏形(Mumford, 1992 - On the computational architecture of the neocortex. II. The role of cortico-cortical loops)。在视觉皮层的“端抑制/非经典感受野效应”等具体现象上,Rao & Ballard 的模型工作使这一思想获得了可计算与可检验的形式(Rao & Ballard, 1999 - Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects)。
2. 预测编码的数学框架 (Predictive Coding)
2.1 变分推断 (Variational Inference)
(对应综述:Millidge et al., 2022 - Predictive Coding: a Theoretical and Experimental Review,§2.1 “Predictive Coding as Variational Inference”)Friston (2003, 2005, 2008) 将预测编码形式化为基于高斯生成模型的近似贝叶斯推断(“预测编码作为变分推断”),并将 Rao & Ballard 式的能量函数与变分自由能联系起来(Friston, 2003 - Learning and inference in the brain;Friston, 2005 - A theory of cortical responses;Friston, 2008 - Hierarchical models in the brain;Rao & Ballard, 1999 - Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects)。在机器学习/统计学语境下,变分推断体系的经典综述与教程可参考(Beal et al., 2003 - Variational algorithms for approximate Bayesian inference;Jordan et al., 1999 - An introduction to variational methods for graphical models;Blei et al., 2017 - Variational inference: A review)。 * 目标:推断潜在状态 $x$ 以解释观测 $o$。我们希望计算后验 $p(x|o)$。 * 方法:引入近似后验 $q(x|o;\phi)$,并通过最小化变分自由能(Variational Free Energy, $\mathcal{F}$)来使其逼近真实后验(Friston, 2005 - A theory of cortical responses;Blei et al., 2017 - Variational inference: A review)。
我们的目标是最小化 KL 散度: $$q^*(x|o;\phi) = \underset{\phi}{\operatorname{argmin}} \ \mathbb{D}[q(x|o;\phi)||p(x|o)] \tag{1}$$
由于真实后验不可知,我们要最小化其上界——自由能 $\mathcal{F}$: $$\mathcal{F} = D_{\text{KL}}[q(x|o;\phi)||p(o,x)] = \underbrace{\mathbb{E}_{q(x|o;\phi)}[\ln q(x|o;\phi)]}_{\text{Entropy}} - \underbrace{\mathbb{E}_{q(x|o;\phi)}[\ln p(o,x;\theta)]}_{\text{Energy}} \tag{4}$$
在高斯生成模型假设下,$p(o, x; \theta) = \mathcal{N}(o; f(\theta_1 x), \Sigma_1)\mathcal{N}(x; g(\theta_2 \bar{\mu}), \Sigma_2)$,自由能简化为预测误差的加权平方和: $$\mathcal{F} = \frac{1}{2}\left[\Sigma_{1}^{-1}\epsilon_{o}^{2} + \Sigma_{2}^{-1}\epsilon_{x}^{2} + \ln 2\pi\Sigma_{1} + \ln 2\pi\Sigma_{2}\right] \tag{5}$$ 其中预测误差定义为 $\epsilon_o = o - f(\mu, \theta_1)$ 和 $\epsilon_x = \mu - g(\bar{\mu}, \theta_2)$。
2.2 多层预测编码 (Multi-layer Predictive Coding)
(对应综述:Millidge et al., 2022,§2.2 “Multi-layer Predictive Coding”)多层(分层)预测编码把单层的潜变量/生成模型推广到深层层级结构(层级生成模型 + 每层局部预测误差/自由能最小化),在 Friston 的分层生成模型与自由能框架中被系统化讨论(Friston, 2005 - A theory of cortical responses;Friston, 2008 - Hierarchical models in the brain)。在视觉皮层模型脉络里,分层结构与自上而下预测/自下而上残差的直观也可追溯到(Rao & Ballard, 1999 - Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects;Mumford, 1992 - On the computational architecture of the neocortex. II. The role of cortico-cortical loops)。 该框架可以扩展到分层结构。对于 $L$ 层模型: $$p(x_0 \dots x_L) = p(x_L) \prod_{l=0}^{L-1} p(x_l | x_{l+1}) \tag{10}$$
总自由能是各层自由能之和: $$\mathcal{F} = \sum_{l=1}^{L} \Sigma_l^{-1} \epsilon_l^2 + \ln 2\pi \det(\Sigma_l) \tag{11}$$ 其中 $\epsilon_l = \mu_l - f_l(\theta_{l+1}, \mu_{l+1})$。
通过对 $\mathcal{F}$ 求梯度下降,得到价值神经元 $\mu$ 和权重 $\theta$ 的更新规则: $$\frac{d\mu_l}{dt} = -\frac{\partial \mathcal{F}}{\partial \mu_l} = \Sigma_l^{-1} \epsilon_l - \Sigma_{l-1}^{-1} \epsilon_{l-1} \frac{\partial f_{l-1}}{\partial \mu_l} \theta_l^T \tag{12}$$ $$\frac{d\theta_l}{dt} = -\frac{\partial \mathcal{F}}{\partial \theta_l} = \Sigma_{l-1}^{-1} \epsilon_{l-1} \frac{\partial f_{l-1}}{\partial \theta_l} \mu_l \tag{13}$$ 这显示了 $\mu$ 的更新是由自下而上的预测误差(驱动项)和自上而下的预测误差(先验项)共同决定的,而权重更新可写成局部的 Hebbian 形式(Friston, 2005 - A theory of cortical responses;Millidge et al., 2022 - Predictive Coding: a Theoretical and Experimental Review)。
2.3 动态预测编码与广义坐标 (Dynamical Predictive Coding)
(对应综述:Millidge et al., 2022,§2.3 “Dynamical Predictive Coding and Generalized Coordinates”)为了处理时间序列,Friston 在分层动态模型中引入了广义坐标/广义运动坐标 $\tilde{\mu} = [\mu, \mu', \mu'', \dots]$,并用自由作用量(free action)统一序列上的自由能最小化(Friston, 2008 - Hierarchical models in the brain;Millidge et al., 2022 - Predictive Coding: a Theoretical and Experimental Review)。 系统最小化自由作用量(Free Action, $\bar{\mathcal{F}}$): $$\bar{\mathcal{F}} = \int dt \mathcal{F}_t \tag{14}$$
最优解满足以下随机微分方程: $$\dot{\tilde{\mu}} = \mathcal{D}\tilde{\mu} - \tilde{\Sigma}_o^{-1}\tilde{\epsilon}_o - \tilde{\Sigma}_x^{-1}\tilde{\epsilon}_x \tag{19}$$ 其中 $\mathcal{D}$ 是导数算子。这表明状态的更新不仅取决于误差项,还包含一个惯性项 $\mathcal{D}\tilde{\mu}$,使其能够预测运动轨迹。
2.4 精度 (Precision)
(对应综述:Millidge et al., 2022,§2.4 “Predictive Coding and Precision”)精度 $\Sigma^{-1}$ 是方差的倒数。它可以作为自由能下降的一部分进行学习,并常被解释为对预测误差的加权(在神经层面与注意/不确定性调制关联)(Feldman & Friston, 2010 - Attention, uncertainty, and free-energy;Kanai et al., 2015 - Cerebral hierarchies: predictive processing, precision and the pulvinar;Millidge et al., 2022)。 $$\Sigma_{l} = \mathbb{V}\left[\tilde{\epsilon}\right] = \mathbb{E}\left[\tilde{\epsilon}_{l}\tilde{\epsilon}_{l}^{T}\right] \tag{22}$$ 这表明最优精度矩阵就是预测误差的协方差矩阵。在神经生物学中,精度的调节被认为对应于注意力机制 (Kanai et al., 2015 - Cerebral hierarchies: predictive processing, precision and the pulvinar)。
2.5 大脑中的预测编码 (Predictive Coding in the Brain)
(对应综述:Millidge et al., 2022,§2.5 “Predictive Coding in the Brain?”)Bastos et al. (2012) 提出了被广泛引用的正则微电路(canonical microcircuit)假说,将预测/误差在不同皮层层(尤其浅层/深层)中的分工与前馈/反馈的频段特征联系起来(Bastos et al., 2012 - Canonical microcircuits for predictive coding;Bastos et al., 2015 - Visual areas exert feedforward and feedback influences through distinct frequency channels;另可参考 Shipp, 2016 - Neural elements for predictive coding)。 * 浅层(L2/3):包含预测误差神经元(向上传播 $\epsilon_l$)和价值神经元 $\mu_l$。 * 深层(L5/6):包含预测神经元,向下一层发送反馈预测 $f(\mu_{l+1})$。 * 这一模型解释了不同皮层层级间的频率耦合现象(Bastos et al., 2015 - Visual areas exert feedforward and feedback influences through distinct frequency channels)。
3. 预测编码的范式 (Paradigms)
3.1 无监督预测编码 (Unsupervised)
(对应综述:Millidge et al., 2022,§3.1 “Unsupervised predictive coding”) * 自编码/重构式预测编码:以重构输入 $o$ 为目标,经典例子是 Rao & Ballard 的分层预测误差网络及其对感受野/非经典效应的解释(Rao & Ballard, 1999 - Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects;与稀疏编码感受野学习脉络相关:Olshausen & Field, 1996 - Emergence of simple-cell receptive field properties by learning a sparse code for natural images)。 * 时间预测编码(视频/序列):以预测 $t+1$ 时刻输入为目标(Lotter et al., 2016 - Deep predictive coding networks for video prediction and unsupervised learning;综述脉络:Millidge et al., 2022)。 * 空间/表征式预测编码(对比式):以预测未来/上下文表征为目标并常与对比学习关联(Oord et al., 2018 - Representation learning with contrastive predictive coding;综述脉络:Millidge et al., 2022)。
3.2 有监督预测编码 (Supervised)
- 前向模式(Generative):顶层固定为标签,底层固定为图像。生成图像容易,分类需迭代。
- 后向模式(Discriminative):底层固定为标签,顶层固定为图像。分类类似于前向传播,速度快。 (对应综述:Millidge et al., 2022,§3.2 “Supervised predictive coding: Forwards and Backwards”)上述两种模式可理解为在生成模型上施加不同的“钳制/边界条件”,从而在生成与判别任务之间权衡推断代价与表达能力(Millidge et al., 2022;亦可对照分层生成模型形式:Friston, 2005 - A theory of cortical responses)。
3.3 松弛预测编码 (Relaxed Predictive Coding)
Millidge et al. (2020c - Relaxing the constraints on predictive coding models) 提出了解决权重传输问题的松弛方案,使用可学习的反馈权重 $\psi$: $$\frac{d\psi_l}{dt} = \mu_{l+1} \frac{\partial f}{\partial \psi} \epsilon_l^T \tag{23}$$ 以及线性化的更新规则(忽略非线性导数): $$\frac{d\mu_l}{dt} = \Sigma_{l-1}^{-1} \epsilon_{l-1} \theta_l^T - \Sigma_l^{-1} \epsilon_l \tag{24}$$
4. 与其他算法的关系 (Relationship to Other Algorithms)
4.1 反向传播 (Backpropagation)
(对应综述:Millidge et al., 2022,§4.1 “Predictive Coding and Backpropagation of error”)Whittington & Bogacz (2017) 与 Millidge et al. (2020) 证明,在固定预测假设等条件下,预测编码的平衡点满足与反向传播链式法则一致的误差递推(Whittington & Bogacz, 2017 - An approximation of the error backpropagation algorithm in a predictive coding network;Millidge et al., 2020 - Predictive coding approximates backpropagation on arbitrary computation graphs): $$\epsilon_i^* = \sum_{j \in \mathcal{C}(v_i)} \epsilon_j^* \frac{\partial \hat{v}_j}{\partial v_i} \tag{32}$$ 这与反向传播的链式法则完全一致。即预测误差神经元收敛到了反向传播的梯度值。
4.2 卡尔曼滤波 (Kalman Filtering)
在线性高斯状态空间模型下: $$\mu_{t+1} = A\mu_t + Bu_t + \omega \tag{33}$$ (对应综述:Millidge et al., 2022,§4.2 “Linear Predictive Coding and Kalman Filtering”)在某些线性高斯假设下,预测编码的更新可与卡尔曼滤波的校正步骤对应(Kalman, 1960 - A new approach to linear filtering and prediction problems;Millidge et al., 2021 - Neural Kalman filtering;综述:Millidge et al., 2022)。例如,对均值的梯度下降为: $$\frac{dL}{d\mu_{t+1}} = -C^T \Sigma_2 \epsilon_o + \Sigma_1 \epsilon_x \tag{41}$$ 这通过迭代过程实现了最优滤波。
4.3 归一化流 (Normalizing Flows)
(对应综述:Millidge et al., 2022,§4.3 “Predictive Coding, Normalization, and Normalizing Flows”)Marino (2020) 指出,预测编码的层级变换 $o = f(\mu)$ 可与归一化流/变分自编码器等生成模型框架建立对应(Marino, 2020 - Predictive coding, variational autoencoders, and biological connections;Kingma & Welling, 2013 - Auto-Encoding Variational Bayes;Rezende & Mohamed, 2015 - Variational inference with normalizing flows): $$p(\mu) = p(o) \left| \frac{\partial f^{-1}}{\partial \mu} \right| \tag{47}$$
4.4 偏置竞争 (Biased Competition)
Spratling (2008 - Reconciling predictive coding and biased competition models...) 证明在线性条件下,预测编码与偏置竞争模型数学等价。预测编码的离散更新: $$\mu_{t+1} = (1 - \eta) \mu_t + \eta \epsilon_o \theta_1^T + \eta \theta_2 \bar{\mu}_t \tag{50}$$ 这与偏置竞争模型的形式一致。
4.5 主动推理 (Active Inference) 与 PID 控制
(对应综述:Millidge et al., 2022,§4.5 “Predictive Coding and Active Inference”)主动推理通过最小化自由能来选择行动 $a$(Friston et al., 2009 - Reinforcement learning or active inference?;Friston, 2010 - The free-energy principle: a unified brain theory?): $$\frac{da}{dt} = -\frac{\partial \mathcal{F}}{\partial a} = -\frac{\partial o(a)}{\partial a} \Sigma_o^{-1} \epsilon_o \tag{51}$$
Baltieri & Buckley (2019 - Pid control as a process of active inference...) 证明,在广义坐标下的线性主动推理等价于 PID 控制器: $$\frac{da}{dt} = -\sigma_z^{-1}(o - \bar{\mu}) - \sigma_{z'}^{-1}(o' - \bar{\mu}) - \sigma_{z'}^{-1}(o'' - \bar{\mu}) \tag{58}$$ 其中各项分别对应 PID 的比例、积分和微分项。
5. 讨论与未来方向 (Discussion and Future Directions)
尽管预测编码理论发展迅速,但仍面临挑战: 1. 神经实现:需解释复杂的皮层连接及抑制性反馈的实现。 2. 生成模型:需探索非高斯、离散生成模型。 3. 脉冲神经网络:需将理论扩展到脉冲网络 (Boerlin et al., 2013 - Predictive coding of dynamical variables in balanced spiking networks)。 4. 连续时间与信用分配:需解决 BPTT 的生物学合理性问题。
注:本文献综述基于 Millidge, Seth, & Buckley 的论文整理。
参考文献(关键条目,含原文标题)
- Millidge, B., Seth, A. K., & Buckley, C. L. (2022). Predictive Coding: a Theoretical and Experimental Review.
- Bogacz, R. (2017). A tutorial on the free-energy framework for modelling perception and learning.
- Buckley, C. L., Kim, C. S., McGregor, S., & Seth, A. K. (2017). The free energy principle for action and perception: A mathematical review.
- Clark, A. (2013). Whatever next? predictive brains, situated agents, and the future of cognitive science.
- Friston, K. (2003). Learning and inference in the brain.
- Friston, K. (2005). A theory of cortical responses.
- Friston, K. (2008). Hierarchical models in the brain.
- Friston, K. (2010). The free-energy principle: a unified brain theory?.
- Friston, K. J., Daunizeau, J., & Kiebel, S. J. (2009). Reinforcement learning or active inference?.
- Rao, R. P. N., & Ballard, D. H. (1999). Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects.
- Mumford, D. (1992). On the computational architecture of the neocortex. II. The role of cortico-cortical loops.
- Beal, M. J., Ghahramani, Z., & Rasmussen, C. E. (2003). Variational algorithms for approximate Bayesian inference.
- Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models.
- Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational inference: A review.
- Neal, R. M., & Hinton, G. E. (1998). A view of the EM algorithm that justifies incremental, sparse, and other variants.
- Feldman, H., & Friston, K. (2010). Attention, uncertainty, and free-energy.
- Kanai, R., Komura, Y., Shipp, S., & Friston, K. (2015). Cerebral hierarchies: predictive processing, precision and the pulvinar.
- Bastos, A. M., Usrey, W. M., Adams, R. A., Mangun, G. R., Fries, P., & Friston, K. J. (2012). Canonical microcircuits for predictive coding.
- Bastos, A. M., Vezoli, J., Bosman, C. A., Schoffelen, J.-M., Oostenveld, R., Dowdall, J. R., et al. (2015). Visual areas exert feedforward and feedback influences through distinct frequency channels.
- Shipp, S. (2016). Neural elements for predictive coding.
- Whittington, J. C. R., & Bogacz, R. (2017). An approximation of the error backpropagation algorithm in a predictive coding network.
- Millidge, B., Tschantz, A., & Buckley, C. L. (2020). Predictive coding approximates backpropagation on arbitrary computation graphs.
- Millidge, B., Tschantz, A., & Buckley, C. L. (2020). Relaxing the constraints on predictive coding models.
- Kalman, R. E. (1960). A new approach to linear filtering and prediction problems.
- Millidge, B., et al. (2021). Neural Kalman filtering.
- Marino, J. (2020). Predictive coding, variational autoencoders, and biological connections.
- Kingma, D. P., & Welling, M. (2013). Auto-Encoding Variational Bayes.
- Rezende, D. J., & Mohamed, S. (2015). Variational inference with normalizing flows.
- Spratling, M. W. (2008). Reconciling predictive coding and biased competition models: a unified model of cortical computation.
- Baltieri, M., & Buckley, C. L. (2019). PID control as a process of active inference with linear generative models.
- Lotter, W., Kreiman, G., & Cox, D. (2016). Deep predictive coding networks for video prediction and unsupervised learning.
- Oord, A. v. d., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding.
- Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images.
- Barlow, H. B. (1961). The coding of sensory messages.
- Helmholtz, H. v. (1866). Concerning the perceptions in general.