Bitcoin's Crypto Flow Network

How crypto flows among Bitcoin users is an important question for understanding the structure and dynamics of the cryptoasset at a global scale. We compiled all the blockchain data of Bitcoin from its genesis to the year 2020, identified users from anonymous addresses of wallets, and constructed monthly snapshots of networks by focusing on regular users as big players. We apply the methods of bow-tie structure and Hodge decomposition in order to locate the users in the upstream, downstream, and core of the entire crypto flow. Additionally, we reveal principal components hidden in the flow by using non-negative matrix factorization, which we interpret as a probabilistic model. We show that the model is equivalent to a probabilistic latent semantic analysis in natural language processing, enabling us to estimate the number of such hidden components. Moreover, we find that the bow-tie structure and the principal components are quite stable among those big players. This study can be a solid basis on which one can further investigate the temporal change of crypto flow, entry and exit of big players, and so forth.


Introduction
Cryptoasset or cryptocurrency is essentially a digital ledger to record transactions between creditors and debtors, just like money. The digital system is based on a collection of non-centralized ledgers, called blockchain, which contains all the historical record of transactions among anonymous users. Today there are many cryptoassets being exchanged in markets with fiat currencies and also with each other. The market capitalization is so huge in total ranging from one to a few trillion USD, and highly volatile potentially having a big impact even on asset markets and prices of non-crypto at a global scale.
In this paper, we study Bitcoin, the largest one dominating nearly half of the market capitalization at the time of writing. We attempt to understand the flow of crypto as a complex network comprising of the users as nodes and the crypto flow as links. There are a number of studies from such a viewpoint of complex network on cryptoassets. See [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16] for example, and references therein.
Specifically, in this paper, we focus on "big players" who are defined as persistently appearing users, likely to be involved in transactions of high frequencies and large amounts, and address the following questions. First, it is important to identify the users in the upstream, downstream, or core in the entire crypto flow. We shall examine the so-called "bow-tie" structure of the network of those big players to classify the location of the crypto flow based on the binary relationship of links. Second, we measure the location in a more quantitative way by using the information of flow along the links in the combinatorial method of Hodge decomposition. Third question is to extract "principal components" hidden in the entire crypto flow so as to uncover a certain number of latent factors or components.
Fourth, because the network is changing in time, what can one say about the stability of the crypto flow?
In Section 2, we describe our dataset of Bitcoin to identify the anonymous addresses representing wallets into users. Then we define regular users as big players and construct networks. In Section 3.1, we perform the bow-tie analysis to locate users in the stream of crypto flow. In Section 3.2, we use the method of Hodge decomposition to quantify the location of users. In Section 3.3, we introduce nonnegative matrix factorization as a method of matrix decomposition to reveal principal components hidden in the flow. We shall show in Section 3.4 that the method can be interpreted as a probabilistic model, from which one can estimate the number of components. In Section 3.5, we find about a dozen of principal components among several hundred big players, and show that the temporal change of the network is quite stable. In Section 4, we discuss about several aspects, being worth further investigation, and conclude in Section 5. We add Appendix A for the identities (actual names, business, and so forth) of selected users. Appendix B illustrates the networks in adjacency matrices. Appendix C briefly summarizes the above mentioned probabilistic model in relation to latent Dirichlet allocation, from which the estimation on the number of components is done in Appendix D.

Data
We employ the dataset of all transactions recorded in the Bitcoin blockchain from the genesis block (first block issued on January 9, 2009) until the block of height 63,299 (inclusive; issued on June 4, 2020). Each transaction is a transfer of a certain amount of BTC (monetary unit of Bitcoin) from one or more addresses to others as we will see shortly. We call such a transfer of BTC as crypto flow. An address is something like a wallet possessed by a user who can be an individual or, more frequently today, an agent in the business of exchanges, services, gambling, and so forth.
In the dataset, total number of transactions was 1.38 billion, while the number of different addresses was about 657 million. To study crypto flow of Bitcoin, one needs to know users, rather than the addresses. However, it is not straightforward to identify users from addresses because of the very nature of anonymity inherent in the core technology of blockchain. See [17] for technical details.
Let us employ a simple but useful method to identify users from addresses to construct a giant graph comprising of nodes as users and edges as crypto flow. We shall see that more than 60% of the addresses can be identified with users. Additionally we will see that a number of users can be revealed with their actual names, types of business, and sometimes geographical location at a global scale. Then we define regular users as big players in order to focus on a subgraph comprising of frequently appearing users who are involved in crypto flow with huge amounts of BTC. The subgraph will be studied in the subsequent sections.

Identification of Users from Addresses
Consider an example of such a transaction (TX) that one day Alice transferred 1 BTC to Bob: where the addresses a 1 and a 2 belong to Alice, while a 123 belongs to Bob. Alice needs more than one address as input of TX 1 , because a single one was not sufficient to fulfill the amount of 1 BTC. Output of TX 1 includes a 1 representing the change. Another day, Alice did another transaction: where the address a 3 also belongs to Alice. Obviously, multiple addresses, if and only if they appear in an input of a transaction, belong to the same user, namely her wallets. As a consequence from both of (1) and (2), it follows that a 1 , a 2 , a 3 can be identified to belong to the same user. Note that a 2 and dominant, followed by Canada, Australia, Brazil, Singapore, and Russia. In fact, as found in the top of Table A·3, the user ID 0000000000 corresponding to the maximum size in Fig. 1 is actually Bit-x.com and Xapo.com, the former of which is an agent of exchange in South Africa. Exchanges are a typical category of "big players" in the sense that they actually hold a huge number of individuals and agents as customers resulting in a large number of transfers. As a matter of fact, the daily number of transfers has an interesting weekly pattern. There is significantly less activity in weekends than in weekdays as found in recent data of Bitcoin 1 (see our previous study [21,22]). Such a weekly pattern implies that those institutional agents are dominant in the entire flow of crypto.

Crypto Flow Network
Now all the transactions among addresses are converted into transfers from users to users, each of which has the following information: • user of source s, • user of destination d, • amount of Bitcoin transferred from s to d, i.e. s → d, • UTC time of transfer (from the block containing the transaction).
During a certain period of time T , for a pair of users, there can be more than one transfer as depicted in Fig. 3 (see the left-hand side). In this example, there are three transfer of crypto for i → j, one for j → i, and also two for i → i. The last case of self-loop is possible, because one can receive a change in a transaction, and also because different addresses are possibly identified with a user, like an exchange. Given a time-scale T , it would be reasonable to aggregate these transactions as shown in the right-hand side of Fig. 3.
After the aggregation, one has a network comprising of nodes as users and edges with direction given by the transfer of crypto, frequency and amount of transfer occurred during the period of time. Let us denote the following variables which represent the strength or weight of each edge by frequency f i j ≡ frequency of transfers for i → j, amount of flow g i j ≡ amount of total transfers for i → j. Regarding the time-scale T for aggregating the transactions and the epoch to select in the historical data, we choose one month and the calendar year of 2019. By examining the time-series for the daily number and amount of transactions, we assumed that the period of one month is adequate to study the stability and temporal change of the crypto flow. Shorter period may lead to a trivial result for the stability and could be insufficient to detect the temporal change, if any change is present. Also longer period would be misleading due to non-equilibrium nature of the system. The year 2019 was chosen as the epoch, in which one does not see violent bubble or crush in the price.
The number of all the users is huge, 315 million users in total. Fortunately, however, it is not necessary to include all of them, because most of them do not appear frequently. In the next section, we shall extract only a tiny part of the network by focusing on the "regular users" who appeared everyday during the specified period.

Regular Users as Big Players
For our purpose in this paper, it is sufficient to focus on the crypto flow with high frequency and big amount of Bitcoin, because infrequent and/or small amount of flows is obviously unimportant for the understanding the entire flow. In other words, it suffices to focus on "big players" who are playing some dominant role in the game of crypto flow. It would be possible to define such a big player in different ways. In this study, we define it by looking at how persistently the user appears in transactions during our specified period of time. Fig. 4 depicts examples of users who appear in different numbers of transactions on daily basis. The user of the top case is persistent in committing transactions with other users, which can be labeled as a "regular user". The middle case changes the persistency from being inactive with no transaction with anyone else to being active in an abrupt way. The bottom case has little activity, just a few transactions on particular days having a strong intermittency.
We define regular users as those appearing everyday during the period of one year in 2019, and use them as big players. The number of regular users was 479. Then we construct a subgraph in each month, which is comprised of the regular users as nodes and the crypto flow as links, the latter of which are aggregated as described in Fig. 3. Thus we have 12 snapshots of such subgraphs, each Illustrative examples for the activities of users. Each plot depicts the daily number of transactions, in each of which the user is either source or destination of the transaction for the period of year 2019. Selfloops (case of the same source and destination) are excluded. Top is a "regular" user appearing every day. The middle user became active from being inactive, while the bottom one has intermittency in its activity. We focus on regular users in this paper. corresponding to each month in the year, from January to December. Summarizing the processing of the whole dataset, we constructed the snapshots of networks denoted by G t = (V t , E t ), where t is the month, V t is the set of regular users, E t is the set of links among them, each having the frequency and amount as depicted in Fig. 3.
We add Fig. 5 showing the histogram for the UTC time of highest activities of all the regular users in the year of 2019. This information gives geographical locations of those regular users.

Basic Properties of Network
For each month t, we constructed a network denoted by G t = (V t , E t ) as described in the preceding section. Table I is the summary of basic properties of the networks in the year 2019. V t corresponds to the regular users in each month, the number of which is shown by the column |V t |. The column |E t | is the number of edges, namely the number of different crypto flow from one user to another or to itself (self-loops as shown in parentheses). Most of the users have self-loops. Temporal change of the network causes the changes of V t and E t . The column of |V t ∩ V t+1 | is the number of users that are common to successive months in their appearance. One can see that most of the users are appearing successively. The same is true for the edges as shown in the column of |E t ∩ E t+1 |. In other words, the network is not changing drastically in terms of the entry and exit of nodes and edges during the time-scale of months. Table I. Summary of basic properties of the networks in the year 2019, January to December. We found that most of the users have self-loops, as shown in the parentheses of column |E t |, with the frequencies f ii and the amounts g ii being highly correlated with the number of addresses identified in the preceding Section 2.1 as naturally expected. Because our main interest in this paper is the crypto flow from one user to another, we remove all the self-loops in what follows.
Adjacent matrices with the strength of links given by the frequencies f i j of G t for all t's are illustrated in Appendix B. One can see that the overall picture does not change in time, but the illustration does not help to uncover the nature of connectivity and flow.
To see the connectivity of network, namely how those regular users are linked among them and also how they are located in the stream of cypto flow, let us examine the property of connected com- ponents. First, decompose G t into weakly connected components (WCC), i.e. connected components when regarded as an undirected graph. We found that there exists a giant WCC (GWCC) containing most of the users. See the column GWCC of Table I. There was only a small number of disconnected components as shown in the same column.
Then, in order to identify the location of users contained in the GWCC, we employed the wellknown analysis of "bow-tie" structure [23]. In general, GWCC can be decomposed into the following parts: GSCC Giant strongly connected component: the largest connected component when viewed as a directed graph. One or more directed paths exist for an arbitrary pair of firms in the component. IN The nodes from which the GSCC is reached via at least one directed path.
OUT The nodes that are reachable from the GSCC via at least one directed path. TE "Tendrils"; the rest of the GWCC.

It follows that
GSCC is the core of the crypto flow's circulation. The IN and OUT parts are upstream and downstream of the flow respectively. The users in the part of IN are playing a role of suppliers of crypto, while the OUT users are considered to be consumers of crypto. Table I shows the bow-tie structure in the column of GSCC/IN/OUT/TE. For example, in September, 470 users are located into GSCC (325 users), IN (23), OUT (113), and TE (5). One can observe that a large fraction of the users in the GWCC is located in the GSCC, as one can easily interpret this fact in the way that those regular users are circulating crypto globally. There are a less fraction of the users in IN and OUT with asymmetry in the numbers.
It would be interesting to see how the individual users are located in the temporal change of the network. Fig. 6 depicts such a diagram of temporal change from one month to its successive one in the whole year. One can see that the groups of GSCC, IN, and OUT are very stable in each membership of users. This fact means that those users appearing in successive months are playing stable roles in the crypto flow's circulation and the location of upstream and downstream. We remark that analysis of bow-tie structure is based on the binary links, namely either presence or absence among the nodes, but not on the strength of links such as frequency and amount of crypto flow. In the next section, we shall see how to quantify the location of users by using the so-called Hodge decomposition.

Hodge Decomposition
Helmholtz-Hodge-Kodaira decomposition, or simply Hodge decomposition, is a combinatorial method to decompose flow on a network into circulation and gradient flow. Original idea dates back to the Helmholtz theorem in vector analysis, which states that under appropriate conditions any vector field can be uniquely represented by the sum of an irrotational or rotation-free (curl-free) vector field and a divergence-free (solenoidal) vector field. The theorem can be generalized from Euclidean space to graph and other entity as shown by Hodge, Kodaira and others. See [24][25][26] for readable exposition. The method has a wide range of applications in the studies such as neural network [27], economic networks [28,29], and also our previous work on Bitcoin and [30].
We recapitulate the method briefly for the present manuscript to be self-contained. Let A i j denote the adjacency matrix: We excluded all the self-loops, implying that A ii = 0. Each link has a flow, denoted byF i j , either of the frequency, f i j , or the amount, g i j , of the transfer from i to j (see Fig. 3). Definẽ Note that there can be a pair of users such that A i j = A ji = 1 andF i j ,F ji > 0. Let us define a "net flow" F i j by and a "net weight" w i j by Note that w i j is symmetric, i.e., w i j = w ji , and non-negative, i.e., w i j ≥ 0 for any pair of i and j 2 . Hodge decomposition is given by where the circular flow which implies that the circular flow is divergence-free. The gradient flow F (g) i j can be expressed as Thus the weight w i j serves to make the gradient flow possible only where a link exists. We refer to the quantity φ i as the Hodge potential. Large value of φ i implies that the user i is in the upstream of the entire network, while small values implies i is in the downstream. Combine (8), (9), and (10), one can derive the following equation to determine φ i .
for i = 1, . . . , N. Here, L i j is the so-called graph Laplacian and defined by where δ i j is the Kronecker delta. It is easy to show that the matrix L = (L i j ) has only one zero mode (eigenvector with zero eigenvalue). The presence of this zero mode simply corresponds to the arbitrariness in the origin of φ. All the other eigenvalues are positive (see, e.g., [30]). Therefore, (11) can be solved for the potentials by fixing the potentials' origin. We assume that the average value of φ is zero. Fig. 7 depicts the distributions for Hodge potentials of the users in GSCC, IN, and OUT. One can see that the entire set of distributions is bimodal having two peaks at positive and negative values, while there are a number of values around zero. Obviously, they correspond to IN, OUT, and GSCC, each being located in the upstream, downstream, and core of the entire crypto flow. Moreover, there exists a correlation between the value of the Hodge potential and the net amount of demand or supply of crypto by each user. See [30] for details, where we studied a daily snapshot of the network including all the users, not only big players. We claim that the same property holds also for the monthly data restricted to big players of regular users.

Non-negative Matrix Factorization
It would be a natural question whether there are distinctive ingredients of flows in the crypto flow or not. The analysis of bow-tie structure is based merely on the binary relationship of links, so does not give such information, because the crypto flows from upstream to downstream with circulation in the giant strongly connected component that occupies a large fraction of the entire network. In other words, are there any "principal components" that constitute the entire flow in a decomposition? In order to find such principal components or latent factors in the transfer of crypto among big players, we shall apply non-negative matrix factorization (NMF) to the strength of links, namely the matrix of the frequencies and amounts of transfer. We recapitulate the method here. See [31][32][33] and references therein for introduction.
Let X be an N × M non-negative matrix, in general, to start with; that is, its elements are all non-negative, denoted as X ≥ 0. NMF gives an approximation of X by a product of two matrices: where S , D are N × K and K × M non-negative matrices, S , D ≥ 0, respectively 3 . In practice, one expects that K is much smaller than N and M so that the factorization gives a compact representation of X. We shall assume that N = M for our application of crypto flow among N users in what follows. Explicitly in components, (13) reads where the indices s and d represent source and destination (s, d = 1, . . . , N) respectively, and X sd is the strength of crypto flow, quantified by frequency f sd , amount g sd , or similar variables, from s to d in a certain period of time. We choose in this paper. See Fig. B·1 in Appendix B for the illustration of X sd . We would expect that K ≪ N, because of the sparsity of X. How to determine K is discussed later. The approximation in (13) is actually given by the following optimization: where the function F(·, ·) is the so-called Kullback-Leibler (KL) divergence defined by Note that F(A, B) = 0 if and only if A = B. The reason why we choose the particular function of (17) will be clarified later 4 . Technically, one can solve (16) iteratively with the initialization of S , D using non-negative double singular value decomposition (see the review [32] and references therein). Although the iterative algorithm yields local minima, our numerical solutions under different random seeds gave essentially the same decomposition.
To understand the meaning of the decomposition, let us consider how a source distributes flow to different destinations. For an arbitrary source s, (14) can be written as where X s is the vector of s-th row of X, and D k is the vector of k-th row of D. Equation (18) means that the flow from the source s can be expanded in terms of "basis" vectors, D k (k = 1, . . . , K). The components (D k ) d = D kd represent how destinations are distributed among users in the k-th NMF component. It is convenient to normalize D k by L1-norm, that is, by defining so that one has for all k. With respect to this normalized basis vectors, the expansion in (18) is rewritten as Thus the outgoing flow from the source s is approximately expressed by a linear combination of K normalized basis vectors D k with coefficients given by S sk D k . Similarly, consider how a destination d collects flow from different sources. For an arbitrary destination d, (14) reads where X d is the vector of d-th column of X, and S k is the vector of k-th column of S . The components of (S k ) s = S sk represent how sources are distributed among users in this k-th NMF component. Define and one has for all k. Then (22) is rewritten as Thus the incoming flow to the destination d is approximately expressed by a linear combination of K normalized basis vectors S k with coefficients given by D kd S k . How can one determine K? Obviously, the larger K is, the better the approximation (13) is, but with less parsimonious representation of the data. In the next section, let us make a detour to examine this issue from a different perspective.

NMF as a probabilistic model
We can interpret the NMF as a probabilistic model. Denote the right-hand of (14) by which are regarded as parameters to be estimated from the data X such that X sd is assumed to be a random number chosen from a Poisson distribution with the parameter ξ sd as It is easy to see that the log likelihood function L(ξ sd ) ≡ log P(X sd | ξ sd ) takes the maximum value at ξ sd = X sd . Then one can introduce a quantity to measure how much the estimation of the parameters is good, that is to be minimized. One can see that this quantity is equivalent to the KL divergence in (17) 5 . To express the entire framework in probabilistic terms more explicitly, let us normalize the data X in (15) by Then let us rewrite (14) as where D kd and S sk were given by (19) and (23) respectively, and which satisfies that k r k = 1. Let us denote the right-hand side of (30) by which satisfies that s,d p sd = 1. We remark that the normalized weight r k defined by (31) gives the information of relative importance of the k-th NMF component in the expansion with normalized basis vectors in (32). One can determine the ordering of NMF components uniquely according to the magnitudes of r k . Suppose that there are N f transfers in total during a period of time. For each pair of source and destination, s and d, generate a transfer s → d with the probability given by p sd , being independently of other pairs. Under the assumption of a small probability of p sd and a large number of N f , X sd follows a Poisson distribution with the parameter, ξ sd = N f p sd .
It turns out that the decomposition in (26), or equivalently (32), has an interesting connection with machine learning. In natural language processing, it is often necessary to extract topics among documents comprising of words or terms. In a situation of unsupervised learning, the task is to infer topics as hidden or latent variables, which can explain a collection of documents, each being an unordered set of terms. Probabilistic latent semantic analysis (PLSA) is a probabilistic model for doing such a task [36]. Suppose that there are N documents and M terms. Then the occurrence of terms can be expressed by a document-term matrix X with size N × M, each element of which is the frequency of occurrence of a term in a document. Topics are latent variables to explain the data X. A topic is actually a probability distribution for the occurrence of terms with different probabilities. A document can have a mixture of topics. An example is a document on "influence of hosting Olympics to economy" with a mixture of topics on sports and economy.
One of the widely used model of PLSA is latent Dirichlet allocation (LDA). See Appendix C and references therein. For our purpose, it suffices to understand how terms are generated at locations of documents in a probabilistic way. The probability that a term is chosen at a location in a document is given by the sum of K factors, each of which is the product of two probabilities; the probability that   [38][39][40] are compared. Each measure of coherence is drawn in the vertical axis so that it is to be minimized to find the optimal number of components. Maximum and minimum values in the region of K are scaled to 1 and 0 respectively to make the comparison easier. One can see that K = 11 ∼ 13 are optimal. (b) Monte Carlo simulations by the method [39] with 20 runs for each K. Averages (points) and 99% level (gray band, narrow) calculated from standard errors are drawn. We conclude that K = 13 is optimal from this result (b). a topic is selected in the document and the one that the term is chosen under the selected topic. See the equation of (C·11) in Appendix C. One can immediately see that (C·11) is essentially the same as (26), or equivalently (32).
Thus the matrix decomposition of NMF can be put in the framework of probabilistic model of PLSA and LDA. As a bonus, one can adopt the method of estimating the number of topics to our problem of determining the number of NMF components, denoted by K in both cases. Interested readers are guided to look at the literature [37][38][39][40] and others given at the end of Appendix C. Let us take a look at our results in the next section finishing the detour of this section.

Result of NMF for Crypto Flow
We first show a few results for a snapshot of September in the year 2019 (denoted as 2019-09), in order to verify if the idea in the preceding section works to determine the number of NMF components. Fig. 8 shows the measures of coherence by employing three different methods of LDA [38][39][40]. The methods give mostly the same optimal values of K, namely K = 11 ∼ 13, consistently as shown in Fig. 8 (a). We found that the measure given in [39] is relatively stable and potentially useful to determine a specific value of K. So we performed Monte Carlo simulations in Fig. 8 (b), and were able to determine the optimal value as K = 13. For this data, X has the dimension of N = 470, so we conclude that one can have a small number of NMF components that can explain the entire flow among those regular users.
In Appendix D, we summarize the same result for the data in all the other months of the year 2019. We found that the optimal number K is quite small in the range more than 10 and less than 20, much smaller than the number of users, N ∼ 500 (see Table I). Additionally, K is relatively stable irrespectively of the temporal change. See Table D·1 and Fig. D·1.
Let us examine each NMF components obtained with the optimal value of K. Fig. 9 and Fig. 10 show the NMF components in terms of the basis vectors, D k and S k , respectively for k = 1, . . . , K. In Fig. 9, each plot shows the vector components of ( D k ) d = D kd meaning how destinations are distributed among users d in the k-th NMF component. Similarly in Fig. 10, each plot shows the vector components of ( S k ) s = S sk meaning how sources are distributed among users s in the kth NMF component. See (19) and (23), and also note the normalization therein. Note that in each of Fig. 9 and Fig. 10, the plots are ordered (from top to bottom) in the descending order of the probability r k given in (31).
One can immediately notice from the figures that the components of these basis vectors are concentrated on a limited number of users, but are not distributed among many users. To quantify the effective number of the concentration, let us use the inverse Herfindahl-Hirschman index, abbreviated as IHH, which is defined as follows. Consider "shares" x i ≥ 0 among i = 1, . . . , N things with the sum equal to 1, i.e. i x i = 1. The IHH is defined by When the shares are equal, x i = 1/N for all i, then IHH = N. On the other hand, when there is the strongest concentration, namely, x i = 1 for a particular i and x i = 0 otherwise, then IHH = 1. So IHH can give an estimate of the effective number of large shares 6 . The idea can be applied to the basis vectors, because the vectors are normalized in the same way as shares. In Fig. 9 and Fig. 10, we displayed all the calculated IHH's. One can see that the IHH's are quite small ranging from a few to a dozen or so, compared with the total number of users N = 470 for the data 2019-09 (see Table I).
How can we use the NMF components to understand the crypto flow? Choose a particular user s as a source s. The flow from s was approximately expressed by a linear combination of K normalized basis vectors D k , each depicted in Fig. 9, with the coefficients given in (21), i.e. S sk D k . The coefficients represent the strength of the decomposed flow from the source s. A similar argument holds by choosing a particular user d as a destination. The flow to d was expressed by a linear combination in (25) with the coefficients, D kd S k .
For example, consider the user with ID 0000000000 that is located in the GSCC, as a source s. Fig. 11 shows the coefficients corresponding to K components; see (a). One can see that the coefficients are non-zero at only four components. Such a sparseness tells that the flow from this user can be expressed with a few components. And the corresponding components have non-zero values at a small number (recall the IHH's) of the vector components, D kd , as shown in Fig. 9, implying that the users corresponding to these non-zero components constitute a cluster for the outgoing flow from the source s.
The same user 0000000000 can be regarded as a destination d in the GSCC. Fig. 11 (b) shows the coefficients, again non-zeros at only one or two components. Together with Fig. 10, one can find another cluster composed of a small number of users for the incoming flow to the d. Similar arguments hold for the users 0000006178 and 0000000012, respectively located in the IN and OUT. See Fig. 11 (c) and (d). In this way, one can find clusters for either of or both of the outgoing and incoming flows of each user.
Each NMF component can be represented by a matrix, because the non-negative matrix of X sd or its probabilistic counter part p sd can be expressed by (18) or (32). It is possible to depict each component k by the matrix of S sk D kd in the normalized way. Fig. 12 and Fig. 13 illustrate such matrices for the data of 2019-09. This should be compared with X sd in Appendix B. One can see that the NMF components provide sparse matrices.
Finally, we find that the NMF components are relatively stable in the temporal change of network. Fig. 14 (a) shows the result for cosine similarities of the NMF basis vectors D k for the two successive months of 2019-09 and 2019-10. Fig. 14 (b) shows the result for S k for the same data. In either of these results, one can see that the NMF components are quite similar except only a few permutation of indices.  The user of (c) is a source, and the user of (d) is a destination. The expansion is given by (21) for the selected source s, and by (25) for the selected destination d. Data: 2019-09.

Discussions
Let us briefly discuss about several aspects that would be worth further investigation. First, while we succeeded to extract the NMF components and found that the components have non-zero values only at a relatively small number of users, we still did not identify those users by exploiting the fact. It is quite likely the case that the extracted users must play important roles in each of the NMF components, either of key destinations or key sources. We attempted to identify a tiny fraction of such users by matching the list of such users with the identity given in Appendix A, but the identification was not sufficient in order to interpret the meaning of corresponding NMF components. Instead, such intra-day activities as shown in Fig. 2 of Section 2.1 could give us the geographical locations of those key users, possibly uncovering the crypto flow in each NMF component at a global scale. This issue remains to be investigated.
Second, even if the temporal change of the network in terms of the NMF components has such a stable structure as found in Fig. 14, we noticed that there exists interesting change of a few components in the same figure. A keen reader may have noticed that the components k = 3, 4, 5, 6 at time t are changed into among themselves at time t + 1, while the cosine similarities are close to 1. This means that the probabilities r k for those components were changed from one month to the next. Also one can notice that the optimal number of NMF components showed a slow variation during the period (recall Table D·1 of Appendix D). These facts might give us a hint for how to treat the temporal change of network by paying attention to those slowing varying aspects. Additionally, while we focused only on regular users appearing everyday during the period under study, it would be necessary to include the process of entry and exit of big players.
Third, technically, we regarded the method of NMF as a probabilistic model that shares the same stochastic process as in the probabilistic latent semantic analysis (PLSA). As a bonus, we were able to employ the latent Dirichlet allocation (LDA) and its known methods to estimate the number of topics in the context of topic model, or the number of NMF components in out context. In principle, one could start with the full-fledged Bayesian framework in the LDA and its extension and variations. It would be worth pursuing in this direction, which is also related to the second point above, because there are several studies on how to treat temporally changing topics of documents in a long time-span.
Fourth, our methods in this paper can be easily applied to different cryptoassets including Ethereum and XRP. We are aware of the paper [20] in this volume, which is in a similar line of study. It would be interesting to apply our methods to the data of XRP.
Finally, it would be an extremely interesting problem how the crypto flow among big players is related to the prices in the exchange markets with fiat currencies and also with other cryptoassets. It is quite likely that the bubble/crash and their precursors might force the big players to react during such turmoils in a different way from tranquil periods. For example, exchanges need to reallocate cryptoassets in the necessity of making a reservoir or doing a release of cryptoassets under the risk.

Summary
Our purpose in this study on the cryptoasset of Bitcoin is to understand the structure and temporal change of crypto flow among big players. We compiled all the transactions contained in the blockchain of Bitcoin cryptoasset from its genesis to the year 2020, identified users from anonymous addresses, and constructed snapshots of networks comprising of users as nodes and links as crypto flow among the users. While the whole network is huge, we extracted sub-networks by focusing on regular users who appeared persistently during a certain period. Specifically, we extracted monthly snapshots during the year of 2019, and selected roughly 500 regular users.
We first analyzed the bow-tie structure from the binary relationship of flow, and then performed the Hodge decomposition based on the strength of flow defined by frequencies and amounts, in order to locate users in the upstream, downstream, and core of the entire crypto flow. We found that the bow-tie structure is stable during the period, implying that those regular users have different roles in the crypto flow.
Then, to reveal important ingredients hidden in the flow, we employ the method of non-negative matrix factorization (NMF) to extract a set of principal components. We discussed that the NMF method can be regarded as a probabilistic model, which is equivalent to a probabilistic latent semantic analysis and its typical model of latent Dirichlet allocation. This observation brought us a method to estimate an optimal number of NMF components, which turned out to be a dozen or so. We found that the NMF components have non-zero values corresponding to a limited number of users, telling us their roles of destinations or sources of the crypto flow. Additionally, we found that the NMF components are quite stable in the temporal change for the time-scale of months.
There remain several points including the further investigation on the users contained in those NMF components, a treatment of temporally changing network, and technically interesting issues to be pursued in the future.

Appendix A: Identity of Users of Type A
WalletExplorer.com [19] is a web site providing information about identity of addresses in Bitcoin blockchain. The site merges addresses together, if they are part of the same wallet, and also identifies wallets with actual names. According to the site, the method to merge addresses is: Just a basic algorithm is used to determine wallet addresses. Addresses are merged together, if they are co-spent in one transaction. So if addresses A and B are co-spent in transaction T1, and addresses B and C are co-spent in transaction T2, all addresses A, B and C will be part of one wallet. Sometimes, an address belongs to some service but it was never co-spent with others. Then that address stays unnamed. It is typically more often at addresses with higher amount (as there is no need to co-spending).
This method is precisely the same as [1], which is the one we employed in Section 2.1. In addition, the identification of actual names is done by WalletExplorer.com as follows: In most of the cases, I registered to service, made transaction(s) and saw which wallet bitcoins were merged with, or from which wallet it was withdrawn. There is probably no easier way how to discover names other than this. Please note that the name database is not updated, so it does not contain newer exchanges (or newer wallets of existing exchanges).
We matched our data with the one in [19] to obtain the identity and additional attributes of users of type A (see Section 2.1 for the type). Table A·1 is the classification into exchanges, services, gambling, historic, and mining pools. Table A·2 shows the list of countries that exchanges belong to. Table A·3 is the complete list of this matching.