Week 2: What Data Structure Are Blockchains?
April 7, 2023
For the second week of my senior project, I continued implementing a blockchain node using Python. In this blog, I will cover the decisions I made while coding. You can view the evolution of my code here.
Firstly I had to consider what data structure would contain the list of chronologically linked blocks. A programming data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. So for my project, I had to determine what structure would best hold a sequence of blocks in which order matters and allows for easy access to the most recent blocks in the sequence.
My first guess was a linked list, a common coding data structure consisting of a series of nodes that contain both data, which in this case are the blocks, and a reference that indicates which node is next in the list. What makes me think this might not be the best approach is that they are difficult to traverse (search through each) and find the most recent node in the list which will be required frequently because each block in the chain must contain the hash of the previous block. Additionally, linked lists are generally best suited for inserting and deleting nodes in a list, but because blockchains are immutable there is no need to ever insert in the middle of a list or delete elements from it. I figured the next best option would be to use an array. A data structure that orders elements (each piece of data in the list, in this case, blocks) and then provides indexes that allow you to access the element at any index. Arrays work well here because they allow for easy access to the length of the chain, are easy to iterate through, and are the structure I am most familiar with. I originally began coding the chain using an array to store blocks but then I realized I needed to be able to allow for more than one singular chain, instead the code must allow for multiple simultaneous chains. It is important for blockchains to be able to fork, or branch off of each other and allow multiple chains to be available simultaneously. Then once one chain becomes significantly longer than the others it will use the longest chain as the primary one and effectively delete the alternatives. Using the longest chain actually secures the chain because it is mathematically improbable for malicious actors on the blockchain to lie about enough information that their fake branch of the chain would become the longest. So to allow for this type of forking I figured the best data structure would be a combination of a dictionary and arrays. A dictionary is built of key-value pairs in which referencing a key will return the value of that associated pair. I set the keys of these dictionaries as integer values, which represent the chain ID number, and the values associated with those keys were an array of blocks for that chain ID. These structures are quite complex so I recommend exploring this site to learn more.