Bitcoin Research with a Transaction Graph Dataset

·

The emergence of Bitcoin in 2008 revolutionized digital finance by introducing a decentralized system for storing and transferring value without reliance on central authorities. As interest in blockchain analytics grows, so does the need for comprehensive, accessible datasets to support advanced research. This article presents a large-scale, publicly available Bitcoin transaction graph dataset designed to empower researchers, data scientists, and developers in exploring the network’s dynamics, entity behaviors, and financial flows.

Spanning nearly 13 years of blockchain activity, this dataset includes 252 million nodes and 785 million directed edges, representing real-world entities and their value transfers. With rich temporal attributes, node labels, and curated metadata, it stands as the most extensive public resource of its kind—enabling deep analysis beyond what previous datasets like Elliptic could offer.


Understanding the Bitcoin Ecosystem

Bitcoin operates on a transparent, immutable ledger known as the blockchain. Every transaction is publicly recorded, creating an open financial ecosystem where value moves peer-to-peer through cryptographic verification. While this transparency fosters trust and auditability, it also opens doors to sophisticated analytical methods—especially when structured into graph-based representations.

In this ecosystem:

Despite full data availability, extracting meaningful insights remains challenging due to the complexity of linking addresses to real-world actors and modeling interactions over time.

👉 Discover how cutting-edge blockchain analytics tools can enhance your research


The Need for Advanced Transaction Graph Datasets

Existing Bitcoin datasets often focus on labeled address lists or small-scale graphs, placing the burden of preprocessing on researchers. For example:

These limitations hinder progress in areas like fraud detection, economic modeling, and behavioral analysis.

To bridge this gap, we introduce a large-scale temporal transaction graph that:

This dataset enables not only classification tasks but also longitudinal studies of network evolution and cross-sector interaction patterns.


Dataset Construction: From Blockchain to Graph

Raw Data Extraction

All transaction data was extracted from the first 700,000 blocks of the Bitcoin blockchain using a self-hosted Bitcoin Core node. The raw ledger was parsed to reconstruct every transaction, preserving timestamps (via block index), input/output structures, and script details.

This process yielded over 670 million transactions, forming the foundation of the graph.

Node Definition: Clustering Scripts into Real Entities

Rather than treating individual addresses as nodes—which inflates noise—we applied advanced heuristics to cluster scripts likely controlled by the same entity.

Key steps:

  1. Identify UTXOs protected by identical or related locking scripts.
  2. Apply behavioral clustering rules based on co-spending patterns (e.g., multiple inputs from different addresses in one transaction).
  3. Assign each cluster a unique identifier (alias), representing a single economic actor.

This method reduced over 874 million scripts into approximately 252 million entity clusters, significantly improving analytical clarity.

Edge Formation: Modeling Value Transfer

An edge is created from sender to recipient when value is transferred between clusters in a transaction. The amount sent is calculated as:

Value transmitted = (Proportion of input contributed) × (Net output received)

Edges are:

This structure allows for dynamic network analysis across time slices and behavioral segmentation.


Handling Special Transactions

Not all transactions reflect genuine economic exchange. Two key types were excluded to preserve data integrity:

CoinJoin Transactions

Designed for privacy, CoinJoins merge multiple users’ transactions, obscuring fund origins. They undermine clustering heuristics and complicate flow tracing. We used pattern-matching heuristics from prior research to detect and exclude them.

Colored Coin Transactions

These embed non-Bitcoin assets (e.g., tokens or real-world assets) within scripts. Detected via known protocols (Open Asset, Omni Layer), they were removed to maintain focus on native BTC flows.


Node Labeling: Bridging On-Chain and Off-Chain Data

Accurate labeling is crucial for supervised learning. We combined multiple sources to assign entity types to clusters:

Primary Source: Bitcointalk Forum Analysis

We collected 14 million posts from Bitcointalk—the largest Bitcoin community forum—to extract contextual references to addresses. Using ChatGPT (gpt-4o-mini) via API calls, we analyzed message content alongside transaction IDs and USD amounts (converted using historical BTC/USD rates) to infer entity identities.

Examples include:

Supplementary Label Sources

To expand coverage:

These efforts produced a labeled set of 101,186 addresses, mapped to script clusters and ultimately to graph nodes.

Entity Types Identified

👉 Explore real-time blockchain insights powered by advanced graph analysis


Technical Validation: Predicting Entity Types with GNNs

To validate dataset quality, we trained several models to predict node labels using both structural and feature-based data.

Models Evaluated

All models used node features like:

Performance Results

ModelMacro-F1 Score
GBC0.57
GraphSage0.61
GAT0.64
GIN0.63

GAT achieved the highest performance, confirming that neighborhood context enhances classification accuracy, especially for mining and betting entities. However, ransomware detection remained challenging due to low sample size and obfuscation tactics.

A confusion matrix revealed frequent misclassification into "individual," suggesting future work should refine distinguishing features for rare classes.


Use Cases and Research Opportunities

This dataset unlocks diverse research pathways:

1. Inter-Entity Flow Analysis

Study how value circulates between exchanges, miners, and illicit services—especially during market shocks or regulatory events.

2. Longitudinal Network Evolution

Track changes in connectivity, density, and centralization over time—revealing adoption trends and structural shifts.

3. Cross-Network Comparisons

Compare Bitcoin’s topology with traditional financial networks or other blockchains to uncover systemic properties.

4. Pretraining for Financial Graph AI

Leverage scale for pretraining GNNs on transaction behavior, later fine-tuning on smaller domains like banking fraud or supply chain finance.


Accessing the Dataset

The full dataset is publicly available and includes:

Database requirements:

Recommended PostgreSQL settings:

shared_buffers = 1GB
work_mem = 16MB
maintenance_work_mem = 1GB
wal_buffers = 2MB
max_parallel_workers_per_gather = 4

Use pg_restore to load:

pg_restore -j 8 -Fd -U user -d db_name dataset

Frequently Asked Questions

Q: How does this dataset differ from Elliptic?
A: Unlike Elliptic’s small, binary-labeled graphs, this dataset covers 13 years, includes diverse entity types, temporal edges, and real-world labeling from multiple sources—making it ideal for broader research beyond anti-money laundering.

Q: Can I use this for commercial applications?
A: Yes, the dataset is open for academic and commercial use. However, always comply with local regulations regarding blockchain data usage.

Q: Why exclude CoinJoin transactions?
A: CoinJoins intentionally obscure transaction trails. Including them would distort clustering and flow analysis, reducing reliability for most research purposes.

Q: How accurate are the labels?
A: Labels are derived from verified sources and contextual AI analysis. While highly reliable for common entities (exchanges, mining), rare categories like ransomware may have lower precision due to limited ground truth.

Q: Is there support for streaming or real-time updates?
A: The current release is static (up to block 700,000). Future versions may include incremental update mechanisms.

Q: Can I contribute new labels?
A: Yes—code repositories are open on GitHub. Contributions that improve labeling accuracy or coverage are welcome.


Final Thoughts

This transaction graph dataset marks a major step forward in Bitcoin research infrastructure. By combining scale, temporal fidelity, and rich labeling, it empowers researchers to move beyond basic address tracking toward deep economic and behavioral analysis.

Whether you're studying financial crime, network topology, or machine learning on graphs, this resource offers unparalleled depth and flexibility.

👉 Start leveraging blockchain intelligence tools today