The emergence of Bitcoin in 2008 revolutionized digital finance by introducing a decentralized system for storing and transferring value without reliance on central authorities. As interest in blockchain analytics grows, so does the need for comprehensive, accessible datasets to support advanced research. This article presents a large-scale, publicly available Bitcoin transaction graph dataset designed to empower researchers, data scientists, and developers in exploring the network’s dynamics, entity behaviors, and financial flows.
Spanning nearly 13 years of blockchain activity, this dataset includes 252 million nodes and 785 million directed edges, representing real-world entities and their value transfers. With rich temporal attributes, node labels, and curated metadata, it stands as the most extensive public resource of its kind—enabling deep analysis beyond what previous datasets like Elliptic could offer.
Understanding the Bitcoin Ecosystem
Bitcoin operates on a transparent, immutable ledger known as the blockchain. Every transaction is publicly recorded, creating an open financial ecosystem where value moves peer-to-peer through cryptographic verification. While this transparency fosters trust and auditability, it also opens doors to sophisticated analytical methods—especially when structured into graph-based representations.
In this ecosystem:
- Transactions transform inputs (spent outputs) into new outputs.
- Unspent Transaction Outputs (UTXOs) represent spendable balances locked under scripts.
- Addresses serve as pseudonyms derived from private keys.
Despite full data availability, extracting meaningful insights remains challenging due to the complexity of linking addresses to real-world actors and modeling interactions over time.
👉 Discover how cutting-edge blockchain analytics tools can enhance your research
The Need for Advanced Transaction Graph Datasets
Existing Bitcoin datasets often focus on labeled address lists or small-scale graphs, placing the burden of preprocessing on researchers. For example:
- Elliptic datasets provide only ~200K nodes with binary labels (licit/illegal), limiting scope.
- Most public resources lack temporal depth, entity-level clustering, or diverse labeling.
These limitations hinder progress in areas like fraud detection, economic modeling, and behavioral analysis.
To bridge this gap, we introduce a large-scale temporal transaction graph that:
- Represents real entities (individuals, exchanges, miners, etc.) as nodes.
- Models directed value flows as timestamped edges.
- Includes over 34,000 labeled nodes across 11 entity types.
- Integrates off-chain data from forums like Bitcointalk to enrich labeling accuracy.
This dataset enables not only classification tasks but also longitudinal studies of network evolution and cross-sector interaction patterns.
Dataset Construction: From Blockchain to Graph
Raw Data Extraction
All transaction data was extracted from the first 700,000 blocks of the Bitcoin blockchain using a self-hosted Bitcoin Core node. The raw ledger was parsed to reconstruct every transaction, preserving timestamps (via block index), input/output structures, and script details.
This process yielded over 670 million transactions, forming the foundation of the graph.
Node Definition: Clustering Scripts into Real Entities
Rather than treating individual addresses as nodes—which inflates noise—we applied advanced heuristics to cluster scripts likely controlled by the same entity.
Key steps:
- Identify UTXOs protected by identical or related locking scripts.
- Apply behavioral clustering rules based on co-spending patterns (e.g., multiple inputs from different addresses in one transaction).
- Assign each cluster a unique identifier (alias), representing a single economic actor.
This method reduced over 874 million scripts into approximately 252 million entity clusters, significantly improving analytical clarity.
Edge Formation: Modeling Value Transfer
An edge is created from sender to recipient when value is transferred between clusters in a transaction. The amount sent is calculated as:
Value transmitted = (Proportion of input contributed) × (Net output received)
Edges are:
- Directed: Reflecting flow direction.
- Timestamped: Using block height as a proxy for time.
- Weighted: By transaction volume in BTC and USD equivalents.
This structure allows for dynamic network analysis across time slices and behavioral segmentation.
Handling Special Transactions
Not all transactions reflect genuine economic exchange. Two key types were excluded to preserve data integrity:
CoinJoin Transactions
Designed for privacy, CoinJoins merge multiple users’ transactions, obscuring fund origins. They undermine clustering heuristics and complicate flow tracing. We used pattern-matching heuristics from prior research to detect and exclude them.
Colored Coin Transactions
These embed non-Bitcoin assets (e.g., tokens or real-world assets) within scripts. Detected via known protocols (Open Asset, Omni Layer), they were removed to maintain focus on native BTC flows.
Node Labeling: Bridging On-Chain and Off-Chain Data
Accurate labeling is crucial for supervised learning. We combined multiple sources to assign entity types to clusters:
Primary Source: Bitcointalk Forum Analysis
We collected 14 million posts from Bitcointalk—the largest Bitcoin community forum—to extract contextual references to addresses. Using ChatGPT (gpt-4o-mini) via API calls, we analyzed message content alongside transaction IDs and USD amounts (converted using historical BTC/USD rates) to infer entity identities.
Examples include:
- Users reporting deposit issues → links address to exchange.
- Withdrawal discussions → identifies service-owned hot wallets.
- Public donation campaigns → ties addresses to individuals or projects.
Supplementary Label Sources
To expand coverage:
- Exchange addresses from CoinMarketCap and DeFiLlama.
- Ransomware wallets from academic datasets (Padua, Montréal).
- Sanctioned entities (e.g., Suex, Hydra) from U.S. SDN lists.
- Mining pools via Coinbase message patterns.
- Betting platforms using URL regex detection on forum links.
These efforts produced a labeled set of 101,186 addresses, mapped to script clusters and ultimately to graph nodes.
Entity Types Identified
- Individual
- Mining
- Exchange
- Marketplace
- Gambling
- Bet
- Faucet
- Mixer
- Ponzi
- Ransomware
- Bridge
👉 Explore real-time blockchain insights powered by advanced graph analysis
Technical Validation: Predicting Entity Types with GNNs
To validate dataset quality, we trained several models to predict node labels using both structural and feature-based data.
Models Evaluated
Graph Neural Networks (GNNs):
- GCN
- GraphSage
- GAT
- GIN
- Gradient Boosting Classifier (GBC) – baseline for tabular features
All models used node features like:
- Total in/out transactions
- Average transaction size (BTC and USD)
- Degree centrality
- Account age (in blocks)
- Temporal activity rates
Performance Results
| Model | Macro-F1 Score |
|---|---|
| GBC | 0.57 |
| GraphSage | 0.61 |
| GAT | 0.64 |
| GIN | 0.63 |
GAT achieved the highest performance, confirming that neighborhood context enhances classification accuracy, especially for mining and betting entities. However, ransomware detection remained challenging due to low sample size and obfuscation tactics.
A confusion matrix revealed frequent misclassification into "individual," suggesting future work should refine distinguishing features for rare classes.
Use Cases and Research Opportunities
This dataset unlocks diverse research pathways:
1. Inter-Entity Flow Analysis
Study how value circulates between exchanges, miners, and illicit services—especially during market shocks or regulatory events.
2. Longitudinal Network Evolution
Track changes in connectivity, density, and centralization over time—revealing adoption trends and structural shifts.
3. Cross-Network Comparisons
Compare Bitcoin’s topology with traditional financial networks or other blockchains to uncover systemic properties.
4. Pretraining for Financial Graph AI
Leverage scale for pretraining GNNs on transaction behavior, later fine-tuning on smaller domains like banking fraud or supply chain finance.
Accessing the Dataset
The full dataset is publicly available and includes:
- Compressed PostgreSQL dump of node/edge features
- Labeled addresses (
addresses.csv) - Bitcointalk thread archives (JSON format)
- Labeling scripts and prompts
Database requirements:
- 40 GB for
node_features - 80 GB for
transaction_edges(with indexes)
Recommended PostgreSQL settings:
shared_buffers = 1GB
work_mem = 16MB
maintenance_work_mem = 1GB
wal_buffers = 2MB
max_parallel_workers_per_gather = 4Use pg_restore to load:
pg_restore -j 8 -Fd -U user -d db_name datasetFrequently Asked Questions
Q: How does this dataset differ from Elliptic?
A: Unlike Elliptic’s small, binary-labeled graphs, this dataset covers 13 years, includes diverse entity types, temporal edges, and real-world labeling from multiple sources—making it ideal for broader research beyond anti-money laundering.
Q: Can I use this for commercial applications?
A: Yes, the dataset is open for academic and commercial use. However, always comply with local regulations regarding blockchain data usage.
Q: Why exclude CoinJoin transactions?
A: CoinJoins intentionally obscure transaction trails. Including them would distort clustering and flow analysis, reducing reliability for most research purposes.
Q: How accurate are the labels?
A: Labels are derived from verified sources and contextual AI analysis. While highly reliable for common entities (exchanges, mining), rare categories like ransomware may have lower precision due to limited ground truth.
Q: Is there support for streaming or real-time updates?
A: The current release is static (up to block 700,000). Future versions may include incremental update mechanisms.
Q: Can I contribute new labels?
A: Yes—code repositories are open on GitHub. Contributions that improve labeling accuracy or coverage are welcome.
Final Thoughts
This transaction graph dataset marks a major step forward in Bitcoin research infrastructure. By combining scale, temporal fidelity, and rich labeling, it empowers researchers to move beyond basic address tracking toward deep economic and behavioral analysis.
Whether you're studying financial crime, network topology, or machine learning on graphs, this resource offers unparalleled depth and flexibility.