Graduate School of Science and Engineering, Yamagata University, Yonezawa, Yamagata 992-8510, Japan
Received December 4, 2015; Accepted January 8, 2016; Published February 9, 2016
A Gaussian restricted Boltzmann machine (GRBM) is a Boltzmann machine defined on a bipartite graph and is an extension of usual restricted Boltzmann machines. A GRBM consists of two different layers: a visible layer composed of continuous visible variables and a hidden layer composed of discrete hidden variables. In this paper, we derive two different inference algorithms for GRBMs based on the naïve mean-field approximation (NMFA). One is an inference algorithm for whole variables in a GRBM, and the other is an inference algorithm for partial variables in a GBRBM. We compare the two methods analytically and numerically and show that the latter method is better.
A restricted Boltzmann machine (RBM) is a statistical machine learning model that is defined on a bipartite graph1,2) and forms a fundamental component of deep learning.3,4) The increasing use of deep learning techniques in various fields is leading to a growing demand for the analysis of computational algorithms for RBMs. The computational procedure for RBMs is divided into two main stages: the learning stage and the inference stage. We train an RBM using an observed data set in the learning stage, and we compute some statistical quantities, e.g., expectations of variables, for the trained RBM in the inference stage. For the learning stage, many efficient algorithms, e.g., contrastive divergence,2) have been developed. On the other hand, algorithms for the inference stage have not witnessed much sophistication. Methods based on Gibbs sampling and the naïve mean-field approximation (NMFA) are mainly used for the inference stage, for example in Refs. 5 and 6. However, some new algorithms7–9) based on advanced mean-field methods10,11) have emerged in recent years.
In this paper, we focus on a model referred to as the Gaussian restricted Boltzmann machine (GRBM) which is a slightly extended version of a Gaussian–Bernoulli restricted Boltzmann machine (GBRBM).4,12,13) A GBRBM enables us to treat continuous data and is a fundamental component of a Gaussian–Bernoulli deep Boltzmann machine.14) In GBRBMs hidden variables are binary, whereas in GRBMs, hidden variables can take arbitrary discrete values. A statistical mechanical analysis for GBRBMs was presented in Ref. 15. For GRBMs, we study inference algorithms based on the NMFA. Since the NMFA is one of the most important foundations of advanced mean-field methods, gaining a deeper understanding of the NMFA for RBMs will provide us with some important insights into subsequent inference algorithms based on the advanced mean-field methods. For GRBMs, it is possible to obtain two different types of NMFAs: the NMFA for the whole system and the NMFA for a marginalized system. First, we derive the two approximations and then compare them analytically and numerically. Finally, we show that the latter approximation is better.
The remainder of this paper is organized as follows. The definition of GRBMs is presented in Sect. 2. The two different types of NMFAs are formulated in Sects. 3.1 and 3.2. Then, we compare the two methods analytically in Sect. 4.1 and numerically in Sect. 4.2, and we show that the NMFA for a marginalized system is better. Finally, Sect. 5 concludes the paper.
2. Gaussian Restricted Boltzmann Machine
Let us consider a bipartite graph consisting of two different layers: a visible layer and a hidden layer. The continuous visible variables, \(\boldsymbol{{v}}= \{v_{i}\in (-\infty,\infty)\mid i\in V\}\), are assigned to the vertices in the visible layer and the discrete hidden variables with a sample space \(\mathcal{X}\), \(\boldsymbol{{h}}= \{h_{j}\in\mathcal{X}\mid j\in H\}\), are assigned to the vertices in the hidden layer, where V and H are the sets of vertices in the visible and the hidden layers, respectively. Figure 1 shows the bipartite graph. On the graph, we define the energy function as \begin{equation} E(\boldsymbol{{v}},\boldsymbol{{h}}; \theta):=\frac{1}{2}\sum_{i \in V}\frac{(v_{i} - b_{i})^{2}}{\sigma_{i}^{2}}-\sum_{i \in V}\sum_{j \in H}\frac{w_{ij}}{\sigma_{i}^{2}}v_{i} h_{j}-\sum_{j \in H}c_{j}h_{j}, \end{equation} (1) where \(b_{i}\), \(\sigma_{i}^{2}\), \(c_{j}\), and \(w_{ij}\) are the parameters of the energy function and they are collectively denoted by θ. Specifically, \(b_{i}\) and \(c_{j}\) are the biases for the visible and the hidden variables, respectively, \(w_{ij}\) are the couplings between the visible and the hidden variables, and \(\sigma_{i}^{2}\) are the parameters related to the variances of the visible variables. The GRBM is defined by \begin{equation} P(\boldsymbol{{v}},\boldsymbol{{h}}\mid \theta):= \frac{1}{Z(\theta)}\exp[-E(\boldsymbol{{v}},\boldsymbol{{h}};\theta)] \end{equation} (2) in terms of the energy function in Eq. (1). Here, \(Z(\theta)\) is the partition function defined by \begin{equation*} Z(\theta):=\int \sum_{\boldsymbol{{h}}}\exp[-E(\boldsymbol{{v}},\boldsymbol{{h}};\theta)]\,d\boldsymbol{{v}}, \end{equation*} where \(\int(\cdots)\,d\boldsymbol{{v}}=\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}\cdots\int_{-\infty}^{\infty}(\cdots)\,dv_{1}\,dv_{2}\cdots dv_{|V|}\) is the multiple integration over all the possible realizations of the visible variables and \(\sum_{\boldsymbol{{h}}} =\sum_{h_{1}\in\mathcal{X}}\sum_{h_{2}\in\mathcal{X}}\cdots\sum_{h_{|H|}\in \mathcal{X}}\) is the multiple summation over those of the hidden variables. When \(\mathcal{X} = \{+1,-1\}\), the GRBM corresponds to a GBRBM.13)
Figure 1. Bipartite graph consisting of two layers: the visible layer V and the hidden layer H.
The distribution of the visible variables conditioned with the hidden variables is \begin{equation} P(\boldsymbol{{v}}\mid \boldsymbol{{h}},\theta) = \prod_{i\in V}\mathcal{N}(v_{i}\mid \mu_{i}(\boldsymbol{{h}}),\sigma_{i}^{2}), \end{equation} (3) where \(\mathcal{N}(x\mid\mu,\sigma^{2})\) is the Gaussian over \(x\in (-\infty,\infty)\) with mean μ and variance \(\sigma^{2}\) and \begin{equation} \mu_{i}(\boldsymbol{{h}}) := b_{i} + \sum_{j\in H}w_{ij}h_{j}. \end{equation} (4) On the other hand, the distribution of the hidden variables conditioned with the visible variables is \begin{equation} P(\boldsymbol{{h}}\mid \boldsymbol{{v}},\theta) =\prod_{j \in H}\frac{\exp[\lambda_{j}(\boldsymbol{{v}})h_{j}]}{\displaystyle\sum_{h \in \mathcal{X}} \exp[\lambda_{j}(\boldsymbol{{v}})h]}, \end{equation} (5) where \begin{equation} \lambda_{j}(\boldsymbol{{v}}):=c_{j} + \sum_{i\in V}\frac{w_{ij}}{\sigma_{i}^{2}}v_{i}. \end{equation} (6) From Eqs. (3) and (5), it is ensured that if one layer is conditioned, the variables in the other are statistically independent of each other. This property is referred to as conditional independence.
The marginal distribution of the hidden variables is \begin{align} P(\boldsymbol{{h}}\mid \theta)&=\int P(\boldsymbol{{v}}, \boldsymbol{{h}}\mid \theta)\,d\boldsymbol{{v}}\notag\\ & =\frac{z_{H}(\theta)}{Z(\theta)}\exp\left(\sum_{j \in H}B_{j}h_{j} + \sum_{j \in H}D_{j} h_{j}^{2}+\sum_{j < k \in H}J_{jk}h_{j}h_{k}\right), \end{align} (7) where \begin{align*} B_{j} &:=c_{j} + \sum_{i \in V}\frac{b_{i}}{\sigma_{i}^{2}}w_{ij},\\ D_{j} &:= \frac{1}{2}\sum_{i \in V}\frac{w_{ij}^{2}}{\sigma_{i}^{2}}, \\ J_{jk} &:= \sum_{i \in V} \frac{w_{ij}w_{ik}}{\sigma_{i}^{2}}, \end{align*} and \begin{equation*} z_{H}(\theta):=\exp \left[\frac{1}{2}\sum_{i \in V}\ln (2\pi \sigma_{i}^{2})\right]. \end{equation*} The sum \(\sum_{j < k\in H}\) is the summation over all distinct pairs of hidden variables. The marginal distribution in Eq. (7) is the standard Boltzmann machine (or the multi-valued Ising model with anisotropic parameters) consisting of the hidden variables.
Using Eqs. (3) and (7), the expectation of \(v_{i}\) is expressed as \begin{align} \langle v_{i}\rangle &:=\int \sum_{\boldsymbol{{h}}} v_{i} P(\boldsymbol{{v}}, \boldsymbol{{h}}\mid \theta)\,d\boldsymbol{{v}}\notag\\ &= \sum_{\boldsymbol{{h}}}\left(\int v_{i} P(\boldsymbol{{v}}\mid \boldsymbol{{h}},\theta)\,d\boldsymbol{{v}}\right)P(\boldsymbol{{h}}\mid \theta)\notag\\ &= b_{i} + \sum_{j \in H}w_{ij}\langle h_{j}\rangle, \end{align} (8) where \(\langle h_{j}\rangle\) is the expectation of \(h_{j}\). Therefore, it is found that the expectations of the visible variables are expressed in terms of the linear combination of the expectations of the hidden variables.
3. Mean-Field Approximations for GRBMs
In this section, for the GRBM defined in the previous section, we derive two different types of mean-field approximations: type I and type II mean-field approximations. The type I mean-field approximation is the NMFA for whole variables in the GRBM and the type II mean-field approximation is the NMFA for the marginal distribution in Eq. (7). The strategy of the type I method is analogous with that in Refs. 8 and 9, and the strategy of the type II method is analogous with that in Ref. 7.
Type I mean-field approximation for GRBMs
Let us prepare a test distribution in the form \begin{equation} T_{1}(\boldsymbol{{v}}, \boldsymbol{{h}}):= \left(\prod_{i \in V}q_{i}(v_{i})\right)\left({}\prod_{j\in H}u_{j}(h_{j})\right) \end{equation} (9) and define the Kullback-Leibler divergence (KLD) between the GRBM in Eq. (2) and the test distribution as \begin{equation} \mathcal{K}_{1}[\{q_{i},u_{j}\}]:=\int \sum_{\boldsymbol{{h}}}T_{1}(\boldsymbol{{v}}, \boldsymbol{{h}}) \ln \frac{T_{1}(\boldsymbol{{v}}, \boldsymbol{{h}})}{P(\boldsymbol{{v}}, \boldsymbol{{h}}\mid \theta)}\,d\boldsymbol{{v}}. \end{equation} (10) The type I mean-field approximation is obtained by minimizing the KLD with respect to the test distribution. The KLD can be rewritten as \begin{equation} \mathcal{K}_{1}[\{q_{i}, u_{j}\}] = \mathcal{F}_{1}[\{q_{i}, u_{j}\}] + \ln Z(\theta), \end{equation} (11) where \begin{align*} \mathcal{F}_{1}[\{q_{i}, u_{j}\}]&:=\int \sum_{\boldsymbol{{h}}} E(\boldsymbol{{v}},\boldsymbol{{h}}; \theta)T_{1}(\boldsymbol{{v}}, \boldsymbol{{h}})\,d\boldsymbol{{v}}\\ &\quad+\int \sum_{\boldsymbol{{h}}} T_{1}(\boldsymbol{{v}}, \boldsymbol{{h}}) \ln T_{1}(\boldsymbol{{v}}, \boldsymbol{{h}})\,d\boldsymbol{{v}}\end{align*} is the variational mean-field free energy of this approximation and it can be rewritten as \begin{align} \mathcal{F}_{1}[\{q_{i}, u_{j}\}]&=\sum_{i\in V}\int_{-\infty}^{\infty}\frac{(v_{i} - b_{i})^{2}}{2\sigma_{i}^{2}}q_{i}(v_{i})\,dv_{i}\notag\\ &\quad-\sum_{i\in V}\sum_{j\in H}\frac{w_{ij}}{\sigma_{i}^{2}}\int_{-\infty}^{\infty}v_{i} q_{i}(v_{i})\,dv_{i}\sum_{h_{j}\in \mathcal{X}}h_{j} u_{j}(h_{j})\notag\\ &\quad - \sum_{j\in H}c_{j}\sum_{h_{j}\in \mathcal{X}}h_{j} u_{j}(h_{j})\notag\\ &\quad+\sum_{i\in V}\int_{-\infty}^{\infty}q_{i}(v_{i})\ln q_{i}(v_{i})\,dv_{i} \notag\\ &\quad+ \sum_{j\in H}\sum_{h_{j} \in \mathcal{X}}u_{j}(h_{j})\ln u_{j}(h_{j}). \end{align} (12) Because \(\ln Z(\theta)\) is constant with respect to the test distribution, we minimize the variational free energy instead of the KLD. By variational minimization of the variational free energy with respect to \(q_{i}(v_{i})\) and \(u_{j}(h_{j})\) under the normalizing constraints \(\int_{-\infty}^{\infty} q_{i}(v_{i})\,dv_{i}=1\) and \(\sum_{h_{j}\in \mathcal{X}}u_{j}(h_{j}) = 1\), we obtain the resulting distributions as \begin{align} q_{i}^{*}(v_{i}) &= \mathcal{N}(v_{i} \mid \mu_{i}(\boldsymbol{{m}}),\sigma_{i}^{2}), \end{align} (13) \begin{align} u_{j}^{*}(h_{j}) &= \frac{\exp[\lambda_{j}({\boldsymbol{{\nu}}})h_{j}]}{\displaystyle\sum_{h \in \mathcal{X}} \exp[\lambda_{j}({\boldsymbol{{\nu}}})h]}, \end{align} (14) where \({\boldsymbol{{\nu}}} = \{\nu_{i}\mid i\in V\}\) and \(\boldsymbol{{m}}= \{m_{j}\mid j\in H\}\) are the expectations defined by \begin{align} \nu_{i} &:= \int_{-\infty}^{\infty}v_{i}q_{i}^{*}(v_{i})\,dv_{i}= \mu_{i}(\boldsymbol{{m}}), \end{align} (15) \begin{align} m_{j} &:= \sum_{h_{j} \in \mathcal{X}}h_{j} u_{j}^{*}(h_{j}). \end{align} (16) The functions \(\mu_{i}\) and \(\lambda_{j}\) are respectively defined in Eqs. (4) and (6). In the mean-field approximation, the distributions in Eqs. (13) and (14) are regarded as the mean-field approximation of the GRBM: \(P(\boldsymbol{{v}},\boldsymbol{{h}}\mid \theta)\approx T_{1}^{*}(\boldsymbol{{v}},\boldsymbol{{h}}) = (\prod_{i\in V}q_{i}^{*}(v_{i}))(\prod_{j\in H}u_{j}^{*}(h_{j}))\). Therefore, \({\boldsymbol{{\nu}}}\) and \(\boldsymbol{{m}}\), satisfying Eqs. (14)–(16), are the approximate expectations of the visible and the hidden variables, respectively, because \begin{align*} \langle v_{i}\rangle &=\int \sum_{\boldsymbol{{h}}}v_{i} P(\boldsymbol{{v}},\boldsymbol{{h}}\mid \theta)\,d\boldsymbol{{v}}\approx \int \sum_{\boldsymbol{{h}}} v_{i} T_{1}^{*}(\boldsymbol{{v}}, \boldsymbol{{h}})\,d\boldsymbol{{v}}= \nu_{i},\\ \langle h_{j}\rangle &=\int \sum_{\boldsymbol{{h}}}h_{j} P(\boldsymbol{{v}},\boldsymbol{{h}}\mid \theta)\,d\boldsymbol{{v}}\approx \int \sum_{\boldsymbol{{h}}} h_{j}T_{1}^{*}(\boldsymbol{{v}}, \boldsymbol{{h}})\,d\boldsymbol{{v}}= m_{j}. \end{align*} By numerically solving the mean-field equations in Eqs. (14)–(16) using a method of successive substitution for example, we can obtain the values of \({\boldsymbol{{\nu}}}\) and \(\boldsymbol{{m}}\).
Type II mean-field approximation for GRBMs
In the type II mean-field approximation, we use a test distribution in the form of \begin{equation} T_{2}(\boldsymbol{{v}}, \boldsymbol{{h}}):= P(\boldsymbol{{v}}\mid \boldsymbol{{h}},\theta)\prod_{j\in H}u_{j}(h_{j}), \end{equation} (17) where \(P(\boldsymbol{{v}}\mid\boldsymbol{{h}},\theta)\) is the conditional distribution in Eq. (3). The KLD between the test distribution and the GRBM is expressed as \begin{align} \mathcal{K}_{2}[\{u_{j}\}]&:=\int \sum_{\boldsymbol{{h}}}T_{2}(\boldsymbol{{v}}, \boldsymbol{{h}}) \ln \frac{T_{2}(\boldsymbol{{v}}, \boldsymbol{{h}})}{P(\boldsymbol{{v}}, \boldsymbol{{h}}\mid \theta)}\,d\boldsymbol{{v}}\notag\\ &=\sum_{\boldsymbol{{h}}}\left({}\prod_{j\in H}u_{j}(h_{j})\right) \ln \frac{\displaystyle\prod_{j\in H}u_{j}(h_{j})}{P(\boldsymbol{{h}}\mid \theta)}, \end{align} (18) where \(P(\boldsymbol{{h}}\mid \theta)\) is the marginal distribution in Eq. (7). The KLD can be rewritten as \begin{equation} \mathcal{K}_{2}[\{u_{j}\}] = \mathcal{F}_{2}[\{u_{j}\}] + \ln Z(\theta), \end{equation} (19) where \begin{align*} \mathcal{F}_{2}[\{u_{j}\}]&:= \sum_{\boldsymbol{{h}}}E_{H}(\boldsymbol{{h}}; \theta)\prod_{j\in H}u_{j}(h_{j}) \\ &\quad+\sum_{\boldsymbol{{h}}}\left(\prod_{j\in H}u_{j}(h_{j})\right) \ln \prod_{j\in H}u_{j}(h_{j}) - \ln z_{H}(\theta) \end{align*} is the variational mean-field free energy of this approximation and \begin{equation*} E_{H}(\boldsymbol{{h}}; \theta):=-\sum_{j \in H}B_{j} h_{j} -\sum_{j \in H}D_{j} h_{j}^{2}-\sum_{j < k \in H}J_{jk}h_{j}h_{k} \end{equation*} is the energy function of the marginal distribution in Eq. (7). This variational mean-field free energy can be rewritten as \begin{align} \mathcal{F}_{2}[\{u_{j}\}]&= -{\sum_{j \in H}B_{j}} \sum_{h_{j} \in \mathcal{X}}h_{j}u_{j}(h_{j}) -\sum_{j \in H}D_{j} \sum_{h_{j} \in \mathcal{X}} h_{j}^{2}u_{j}(h_{j})\notag\\ &\quad - \sum_{j < k \in H}J_{jk}\sum_{h_{j} \in \mathcal{X}}h_{j}u_{j}(h_{j})\sum_{h_{k}\in \mathcal{X}} h_{k}u_{j}(h_{k})\notag\\ &\quad+\sum_{j \in H}\sum_{h_{j} \in \mathcal{X}}u_{j}(h_{j})\ln u_{j}(h_{j}) - \ln z_{H}(\theta). \end{align} (20)
By variational minimization of the variational free energy in Eq. (20) under the normalizing constraints \(\sum_{h_{j}\in \mathcal{X}}u_{j}(h_{j}) = 1\), we obtain \begin{equation} u_{j}^{\dagger}(h_{j}) = \frac{\exp\biggl(B_{j} h_{j} + D_{j} h_{j}^{2} + \displaystyle\sum_{k \in H\setminus\{j\}}J_{jk}m_{k}^{\dagger} h_{j}\biggr)}{\displaystyle\sum_{h \in \mathcal{X}} \exp \biggl(B_{j} h + D_{j} h^{2} + \sum_{k \in H \setminus\{j\}}J_{jk}m_{k}^{\dagger}h\biggr)}, \end{equation} (21) where \(\boldsymbol{{m}}^{\dagger} = \{m_{j}^{\dagger}\mid j\in H\}\) are the expectations defined by \begin{equation} m_{j}^{\dagger} := \sum_{h_{j} \in \mathcal{X}}h_{j} u_{j}^{\dagger}(h_{j}). \end{equation} (22) The resulting test distribution, \(T_{2}^{\dagger}(\boldsymbol{{v}},\boldsymbol{{h}}) = P(\boldsymbol{{v}}\mid\boldsymbol{{h}},\theta)\prod_{j\in H}u_{j}^{\dagger}(h_{j})\), is regarded as the mean-field approximation of the GRBM in this approximation. Therefore, \(\boldsymbol{{m}}^{\dagger}\), which satisfy Eqs. (21) and (22), are the approximate expectations of the hidden variables, because \begin{equation*} \langle h_{j}\rangle =\int \sum_{\boldsymbol{{h}}} h_{j} P(\boldsymbol{{v}}, \boldsymbol{{h}}\mid \theta)\,d\boldsymbol{{v}}\approx \int \sum_{\boldsymbol{{h}}}h_{j} T_{2}^{\dagger}(\boldsymbol{{v}}, \boldsymbol{{h}})\,d\boldsymbol{{v}}= m_{j}^{\dagger}. \end{equation*} By solving the mean-field equations in Eqs. (21) and (22), we can obtain the values of \(\boldsymbol{{m}}^{\dagger}\). The approximate expectations of the visible variables, \({\boldsymbol{{\nu}}}^{\dagger} = \{\nu_{i}^{\dagger}\mid i\in V\}\), can be obtained in terms of \(\boldsymbol{{m}}^{\dagger}\) as \begin{equation} \langle v_{i}\rangle\approx \nu_{i}^{\dagger}:= \int \sum_{\boldsymbol{{h}}}v_{i}\hat{T}_{2}(\boldsymbol{{v}}, \boldsymbol{{h}})\,d\boldsymbol{{v}}= \mu_{i}(\boldsymbol{{m}}^{\dagger}). \end{equation} (23) Using Eqs. (21)–(23) we can rewrite the above mean-field equations as \begin{align} \nu_{i}^{\dagger} &= \mu_{i}(\boldsymbol{{m}}^{\dagger}), \end{align} (24) \begin{align} m_{j}^{\dagger} &= \sum_{h_{j} \in \mathcal{X}}h_{j} y_{j}(h_{j}), \end{align} (25) where \begin{equation*} y_{j}(h_{j}):=\frac{\exp\biggl[\lambda_{j}({\boldsymbol{{\nu}}}^{\dagger}) h_{j} - \displaystyle\sum_{i \in V}(w_{ij}/\sigma_{i})^{2}(m_{j}^{\dagger} - h_{j}/2)h_{j}\biggr]}{\displaystyle\sum_{h \in \mathcal{X}} \exp\biggl[\lambda_{j}({\boldsymbol{{\nu}}}^{\dagger})h - \sum_{i \in V}(w_{ij}/\sigma_{i})^{2}(m_{j}^{\dagger} - h/2) h\biggr]}. \end{equation*} The values of \({\boldsymbol{{\nu}}}^{\dagger}\) and \(\boldsymbol{{m}}^{\dagger}\) can also be obtained by numerically solving the mean-field equations in Eqs. (24) and (25) instead of solving those in Eqs. (21)–(23). The order of the computational cost of solving the mean-field equations is the same as that of the type I mean-field approximation presented in Sect. 3.1.
4. Comparison of Two Mean-Field Methods
In Sects. 3.1 and 3.2, we derived two different mean-field approximations for the GRBM: the type I and the type II mean-field approximations. Both the approximations are constructed on the basis of the NMFA. Now, we are interested in which approximation is better. Intuitively, the type II mean-field approximation seems to be better, because the number of variables that imposes the mean-field assumption, namely, the factorizable assumption of the distribution, in the type II mean-field approximation is less than that in the type I mean-field approximation. In this section, we qualitatively and quantitatively compare the two approximations and show that our intuitive prediction is valid.
Qualitative comparison
Before discussing the qualitative relationship between the mean-field approximations, we provide a general theorem that will be an important basis of our final results in this section. Let us consider continuous or discrete random variables \(\boldsymbol{{x}}= \{x_{i}\mid i = 1,2,\ldots,n\}\) and divide the variables into two different sets: \(\boldsymbol{{x}}=\boldsymbol{{x}}_{A}\cup \boldsymbol{{x}}_{B}\). We define a distribution \(P(\boldsymbol{{x}})\) over the random variables, and define two kinds of test distribution for the distribution as \(T_{\text{all}}(\boldsymbol{{x}}):=Q(\boldsymbol{{x}}_{A})U(\boldsymbol{{x}}_{B})\) and \(T_{\text{part}}(\boldsymbol{{x}}):=P(\boldsymbol{{x}}_{A}\mid \boldsymbol{{x}}_{B})U(\boldsymbol{{x}}_{B})\). \(P(\boldsymbol{{x}}_{A}\mid \boldsymbol{{x}}_{B})\) is the conditional distribution of \(P(\boldsymbol{{x}})\), and \(Q(\boldsymbol{{x}}_{A})\) and \(U(\boldsymbol{{x}}_{B})\) are distributions over \(\boldsymbol{{x}}_{A}\) and \(\boldsymbol{{x}}_{B}\), respectively. For the test distributions \(T_{\text{all}}(\boldsymbol{{x}})\) and \(T_{\text{part}}(\boldsymbol{{x}})\), we define the KLDs as \begin{equation} \mathcal{K}_{\text{all}}[Q,U] :=\sum_{\boldsymbol{{x}}}T_{\text{all}}(\boldsymbol{{x}}) \ln \frac{T_{\text{all}}(\boldsymbol{{x}})}{P(\boldsymbol{{x}})} \end{equation} (26) and \begin{equation} \mathcal{K}_{\text{part}}[U] :=\sum_{\boldsymbol{{x}}}T_{\text{part}}(\boldsymbol{{x}}) \ln \frac{T_{\text{part}}(\boldsymbol{{x}})}{P(\boldsymbol{{x}})}, \end{equation} (27) respectively, where the sum \(\sum_{\boldsymbol{{x}}} =\sum_{x_{1}}\sum_{x_{2}}\cdots\sum_{x_{n}}\) is the multiple summation over all the possible realizations of \(\boldsymbol{{x}}\). If some variables are continuous, the corresponding summations become integrations. Under the setting, we obtain the following proposition.
Proposition 1.
For any distribution \(P(\boldsymbol{{x}})\) over \(n\) random variables \(\boldsymbol{{x}}= \{x_{i}\mid i = 1,2,\ldots,n\}\) with any sample spaces, the inequality \begin{equation*} \min_{Q,U}\mathcal{K}_{\text{all}}[Q,U]\geq \min_{U} \mathcal{K}_{\text{part}}[U] \end{equation*} is ensured, where \(\mathcal{K}_{\text{all}}[Q,U]\) and \(\mathcal{K}_{\text{part}}[U]\) are the KLDs defined in Eqs. (26) and (27), respectively.
Proof.
From Eqs. (26) and (27), \(\mathcal{K}_{\text{all}}[Q,U]\) can be rewritten as \begin{equation*} \mathcal{K}_{\text{all}}[Q,U] = \sum_{\boldsymbol{{x}}}Q(\boldsymbol{{x}}_{A})U(\boldsymbol{{x}}_{B}) \ln \frac{Q(\boldsymbol{{x}}_{A})}{P(\boldsymbol{{x}}_{A} \mid \boldsymbol{{x}}_{B})} + \mathcal{K}_{\text{part}}[U]. \end{equation*} By using this expression, the inequality \begin{align} &\min_{Q,U}\mathcal{K}_{\text{all}}[Q,U]\notag\\ &\quad=\sum_{\boldsymbol{{x}}}Q^{*}(\boldsymbol{{x}}_{A})U^{*}(\boldsymbol{{x}}_{B}) \ln \frac{Q^{*}(\boldsymbol{{x}}_{A})}{P(\boldsymbol{{x}}_{A} \mid \boldsymbol{{x}}_{B})} + \mathcal{K}_{\text{part}}[U^{*}]\notag\\ &\quad\geq \sum_{\boldsymbol{{x}}}Q^{*}(\boldsymbol{{x}}_{A})U^{*}(\boldsymbol{{x}}_{B}) \ln \frac{Q^{*}(\boldsymbol{{x}}_{A})}{P(\boldsymbol{{x}}_{A} \mid \boldsymbol{{x}}_{B})} + \min_{U} \mathcal{K}_{\text{part}}[U] \end{align} (28) is obtained, where \(Q^{*}(\boldsymbol{{x}}_{A})\) and \(U^{*}(\boldsymbol{{x}}_{B})\) are the distributions that minimize \(\mathcal{K}_{\text{all}}[Q,U]\). By using the inequality \(\ln X\leq X - 1\) for \(X\geq 0\), we obtain \begin{align} &-{\sum_{\boldsymbol{{x}}}}Q^{*}(\boldsymbol{{x}}_{A})U^{*}(\boldsymbol{{x}}_{B}) \ln \frac{P(\boldsymbol{{x}}_{A} \mid \boldsymbol{{x}}_{B})}{Q^{*}(\boldsymbol{{x}}_{A})}\notag\\ &\quad\geq \sum_{\boldsymbol{{x}}}Q^{*}(\boldsymbol{{x}}_{A})U^{*}(\boldsymbol{{x}}_{B})\left(1 - \frac{P(\boldsymbol{{x}}_{A} \mid \boldsymbol{{x}}_{B})}{Q^{*}(\boldsymbol{{x}}_{A})}\right) = 0. \end{align} (29) From Eqs. (28) and (29), the proposition is obtained.
In Proposition 1, by regarding \(\boldsymbol{{x}}_{A}\) and \(\boldsymbol{{x}}_{B}\) as \(\boldsymbol{{v}}\) and \(\boldsymbol{{h}}\), respectively, and by regarding \(P(\boldsymbol{{x}})\) as the GRBM, we immediately obtain the following corollary.
Corollary 1.
For the GRBM in Eq. (2), the inequality \begin{equation*} \min_{\{q_{i}, u_{j}\}}\mathcal{K}_{1}[\{q_{i}, u_{j}\}] \geq \min_{\{u_{j}\}} \mathcal{K}_{2}[\{u_{j}\}] \end{equation*} is ensured, where \(\mathcal{K}_{1}[\{q_{i}, u_{j}\}]\) and \(\mathcal{K}_{2}[\{u_{j}\}]\) are the KLDs defined in Eqs. (10) and (18).
A KLD is regarded as a measure of the distance between two different distributions. Corollary 1 suggests that the mean-field distribution obtained by the type II mean-field approximation is closer to the GRBM than that obtained by the type I mean-field approximation from the viewpoint of the KLD.
We can obtain the following proposition for free energies.
Proposition 2.
For the GRBM in Eq. (2), the inequality \begin{equation*} F_{1}(\theta) \geq F_{2}(\theta) \geq F(\theta) \end{equation*} is ensured, where \(F_{1}(\theta)\) and \(F_{2}(\theta)\), defined by \(F_{1}(\theta):=\min_{\{q_{i}, u_{j}\}}\mathcal{F}_{1}[\{q_{i}, u_{j}\}]\) and \(F_{2}(\theta):=\min_{\{u_{j}\}}\mathcal{F}_{2}[\{u_{j}\}]\), are the mean-field free energies obtained by the type I and the type II mean-field approximations, respectively, and where \(F(\theta):= -\ln Z(\theta)\) is the true free energy of the GRBM.
Proof.
Since a KLD is nonnegative, from Eqs. (11) and (19), we obtain \begin{equation} F_{1}(\theta) \geq F(\theta), \quad F_{2}(\theta) \geq F(\theta). \end{equation} (30) From Corollary 1 and Eqs. (11) and (19), we have \begin{equation} F_{1}(\theta) \geq F_{2}(\theta). \end{equation} (31) From Eqs. (30) and (31), we obtain the proposition.
From this proposition, it is guaranteed that the mean-field free energy obtained by the type II mean-field approximation is closer to the true free energy than that obtained by the type I mean-field approximation.
Quantitative comparison
In this section, we quantitatively compare the two mean-field approximations through numerical experiments. In the numerical experiments, we use a GRBM with 24 visible variables and 12 hidden variables. Because the size of the used GRBM is small, we can evaluate the exact values of its free energy and expectations. In the following experiments, we generate the values of the biases, \(b_{i}\) and \(c_{j}\), and of the couplings \(w_{ij}\) from Gaussian distributions, and we fix the values of all \(\sigma_{i}^{2}\) as one.
Figures 2 and 3 show the dependencies of the three free energies, the type I mean-field free energy \(F_{1}(\theta)\), the type II mean-field free energy \(F_{2}(\theta)\), and the true free energy \(F(\theta)\), on the parameters when \(\mathcal{X} = \{-1,+1\}\) and \(\mathcal{X} = \{-1,0,+1\}\), respectively. Since the partition function of the marginal distribution in Eq. (7) is \(Z(\theta)/z_{H}(\theta)\), the true free energy can be evaluated by performing the following multiple summation. \begin{align*} F(\theta)&= - \frac{1}{2}\sum_{i \in V}\ln (2\pi \sigma_{i}^{2}) \\ &\quad- \ln \sum_{\boldsymbol{{h}}}\exp\left(\sum_{j \in H}B_{j} h_{j} + \sum_{j \in H}D_{j}h_{j}^{2} + \sum_{j < k \in H}J_{jk}h_{j}h_{k}\right). \end{align*} The mean-field free energies, \(F_{1}(\theta)\) and \(F_{2}(\theta)\), are obtained by substituting the solutions to the mean-field equations of the type I and type II methods, i.e., \(\{\nu_{i}, m_{j}\}\) and \(\{\nu_{i}^{\dagger}, m_{j}^{\dagger}\}\), into Eqs. (12) and (20), respectively. Each plot in Figs. 2 and 3 is the average over 10000 trials, and the parameters, \(\boldsymbol{{b}}\), \(\boldsymbol{{c}}\), and \(\boldsymbol{{w}}\), used in the experiments were generated as follows. For Figs. 2(a) and 3(a), they were independently drawn from \(\mathcal{N}(b_{i}\mid 0, 0.1^{2})\), \(\mathcal{N}(c_{j}\mid 0, 0.1^{2})\), and \(\mathcal{N}(w_{ij}\mid 0,\text{SD}^{2})\), respectively. For Figs. 2(b) and 3(b), they were independently drawn from \(\mathcal{N}(b_{i}\mid 0,\text{SD}^{2})\), \(\mathcal{N}(c_{j}\mid 0, 0.1^{2})\), and \(\mathcal{N}(w_{ij}\mid 0, 0.1^{2})\), respectively. For Figs. 2(c) and 3(c), they were independently drawn from \(\mathcal{N}(b_{i}\mid 0, 0.1^{2})\), \(\mathcal{N}(c_{j}\mid 0,\text{SD}^{2})\), and \(\mathcal{N}(w_{ij}\mid 0, 0.1^{2})\), respectively. One can observe that the results shown in Figs. 2 and 3 are consistent with the theoretical result presented in Proposition 2.
Figure 2. Dependency of the free energies on the standard deviation of (a) the couplings \(\boldsymbol{{w}}\), (b) the biases \(\boldsymbol{{b}}\), and (c) the biases \(\boldsymbol{{c}}\), when \(\mathcal{X} = \{-1,+1\}\).
Figure 3. Dependency of the free energies on the standard deviation of (a) the couplings \(\boldsymbol{{w}}\), (b) the biases \(\boldsymbol{{b}}\), and (c) the biases \(\boldsymbol{{c}}\), when \(\mathcal{X} = \{-1,0,+1\}\).
Figures 4 and 5 show the dependencies of the mean square errors (MSEs) between the exact expectations and the mean-field solutions on the parameters when \(\mathcal{X} = \{-1,+1\}\) and \(\mathcal{X} = \{-1,0,+1\}\), respectively. The plots “type I (h)” and “type I (v)” are the MSEs between \(\langle h_{j}\rangle\) and \(m_{j}\) and between \(\langle v_{i}\rangle\) and \(\nu_{i}\), respectively, that is, \(|H|^{-1}\sum_{j\in H}(\langle h_{j}\rangle - m_{j})^{2}\) and \(|V|^{-1}\sum_{i\in V}(\langle v_{i}\rangle -\nu_{i})^{2}\), respectively. The plots “type II (h)” and “type II (v)” are the MSEs between \(\langle h_{j}\rangle\) and \(m_{j}^{\dagger}\) and between \(\langle v_{i}\rangle\) and \(\nu_{i}^{\dagger}\), respectively, that is, \(|H|^{-1}\sum_{j\in H}(\langle h_{j}\rangle - m_{j}^{\dagger})^{2}\) and \(|V|^{-1}\sum_{i\in V}(\langle v_{i}\rangle -\nu_{i}^{\dagger})^{2}\), respectively. Each plot in Figs. 4 and 5 is the average over 10000 trials, and the parameters, \(\boldsymbol{{b}}\), \(\boldsymbol{{c}}\), and \(\boldsymbol{{w}}\), used in the experiments were generated in the same manner as that for Figs. 2 and 3. We can observe that the type II method gives better approximations than the type I method.
Figure 4. Dependency of the MSEs of the expectations on the standard deviation of (a) the couplings \(\boldsymbol{{w}}\), (b) the biases \(\boldsymbol{{b}}\), and (c) the biases \(\boldsymbol{{c}}\), when \(\mathcal{X} = \{-1,+1\}\).
Figure 5. Dependency of the MSEs of the expectations on the standard deviation of (a) the couplings \(\boldsymbol{{w}}\), (b) the biases \(\boldsymbol{{b}}\), and (c) the biases \(\boldsymbol{{c}}\), when \(\mathcal{X} = \{-1,0,+1\}\).
5. Conclusion
In this paper, we derived two different types of NMFAs, the type I and the type II methods, for GRBMs and compared them analytically and numerically. Further, we presented propositions and a corollary that guarantee that the type II method provides (1) a lower value of the KLD than the type I method and (2) mean-field free energy that is closer to the true free energy than that provided by the type I method. Moreover, in our numerical experiments, we observed that the expectations obtained by the type II method are more accurate than those obtained by the type I method. Since the orders of the computational costs of the two methods are the same, we can conclude that the type II method is better than the type I method, that is, we should apply the NMFA to a marginalized system rather than to the whole system.
Although the statements presented in this paper were made for only the NMFA, we expect that the insights obtained in this paper can be extended to advanced mean-field methods. From this perspective, the results obtained in this paper implicitly support the validity of the method presented in Ref. 7. We are now interested in the application of more sophisticated mean-field methods, such as the adaptive TAP method16) and susceptibility propagation,17) to GRBMs. In particular, we believe that the application of the adaptive TAP method is important, because as mentioned in Ref. 15 GRBMs are strongly related to a Hopfield-type of system and the adaptive TAP method can be justified in such a system. This will be addressed in our future studies.
Free energies of GRBMs and of their some variations, RBMs with discrete visible and hidden variables and those with continuous visible and hidden variables such as Gaussian–Gaussian RBMs,18) can be evaluated via statistical mechanical methods, such as the replica method, in specific cases. Developments of these free energy evaluations are expected to enable us to evaluate typical performances of inference algorithms in RBMs at the thermodynamic limit. This is also an interesting direction of our future studies.
Acknowledgments
This work was partially supported by CREST, Japan Science and Technology Agency and by JSPS KAKENHI Grant Numbers 15K00330, 25280089, and 15H03699.
References
1 P. Smolensky, in Parallel Distributed Processing: Explorations in the Microstructure of Cognition, ed. D. E. Rumelhart, J. L. McClelland, and CORPORATE PDP Research Group (MIT Press, Cambridge, MA, 1986) Vol. 1, p. 194. Google Scholar
2 G. E. Hinton, Neural Comput. 14, 1771 (2002). 10.1162/089976602760128018 Crossref, Google Scholar
3 G. E. Hinton, S. Osindero, and Y. W. Teh, Neural Comput. 18, 1527 (2006). 10.1162/neco.2006.18.7.1527 Crossref, Google Scholar
4 G. E. Hinton and R. Salakhutdinov, Science 313, 504 (2006). 10.1126/science.1127647 Crossref, Google Scholar
5 R. Salakhutdinov, A. Mnih, and G. E. Hinton, Proc. 24th Int. Conf. Machine Learning (ICML2007), 2007, p. 791. Google Scholar
6 T. Tran, D. Phung, and S. Venkatesh, Proc. 30th Int. Conf. Machine Learning (ICML2013), 2013, Vol. 28, p. 40. Google Scholar
7 H. Huang and T. Toyoizumi, Phys. Rev. E 91, 050101 (2015). 10.1103/PhysRevE.91.050101 Crossref, Google Scholar