High level Data Structures internals
January 29, 2025
823 words
Data structures are probably the most looked up topic in computer science. This is my high level understanding of them from a theoretical standpoint, which I wrote while I was in my computer architecture lecture. We are taught tens of data structures, but almost all the widely used once are essentially built over two primal data structures: arrays and objects. This is a high level overview and implementations differ with more advanced data structures or different languages. Internals of a stack or a queue are often just arrays with some explicit rules of operations. For example, a stack can be implemented by an array like:
arr = [1, 2, 3]
arr[1]
is absolutely workable in code, but the stack abstraction only allows viewing the top element (for instance, peek()
might show the last pushed item). Note that stacks or queues can also be implemented using objects (linked nodes) instead of arrays but we are here for simplicity.
Let’s look at some more complex data structures like trees, graphs, and linked lists. All of these are just objects (nodes, in terms of Data Structures) that hold values and addresses to other connecting objects. A specific arrangement of these objects results in either a linked list (linear connections), or graphs (non-linear connections), or trees (non-linear connections with no cycles), and so on. Generally, data structures that are made up of objects at their core are always interlinked, and accessing from the middle isn’t possible (unlike arrays) unless you explicitly know the pointer to that particular object. The starting point of such data structures, generally called the root or head, is crucial to preserve so we don’t lose track of the whole structure in the ocean of memory.
While iterating over any data structure, we need to identify where its boundary lies. In linear data structures based on arrays, an out-of-bounds index or (in stack terms) stack underflow or overflow indicates we’ve moved out of allocated space. In linear data structures implemented with objects, we usually take an iterative approach with the base case that if the next pointer address is null, it’s the end of the linked list. In non-linear data structures, a recursive approach is often preferred (repeating the same steps until a termination condition is met), which is usually if the next object is null. This makes the code shorter but can be harder to understand in case of nested recursions.
The key property of an array is that it’s stored in contiguous memory, where reading any index is just basic arithmetic on the array’s starting address. That means you can overwrite array values in-place, but changing the memory address of a particular item isn’t how arrays usually work:
Address of Arr[Index] = B + W * (Index – LB)
- Index = The index of the element (not its value).
- B = Base address of the array.
- W = Storage size of one element in bytes (e.g., 4 for an int).
- LB = Lower bound of the index (if not specified, assume zero).
In data structures implemented with objects, changing pointers is often more suitable than overwriting values if you need to rearrange the structure. By changing the pointing addresses of an object, you effectively rearrange the data structure. This is well-suited for data where you often change the arrangement, but maybe not the actual values. Using immutable data types like tuples or strings inside nodes can help maintain data integrity and avoid accidental edits. Typically, data here isn’t changed unless explicitly needed. A simple example is reversing a data structure. In arrays, you’d overwrite values in reverse at the same addresses. In an object-based data structure like a linked list, you’d just change the next pointers in each node. That means the node values stay the same, but they get rearranged like lego blocks. Some more complex data structures, like hashmaps, are a mix of both arrays and objects, making the internal implementation more involved but increasing usage efficiency. CRUD operations are performed in O(1) time at average case. A hashmap is often internally implemented as an array where each entry in the array is the head of a linked list, called a bucket in hashmap terms. Without going into too much detail, the basic idea is that the array acts like an index, and once we find the relevant slot, we iterate over the linked list there to find the specific key or value. There are many standard hashing algorithms designed to distribute keys evenly across the array, reducing collisions and maintaining efficient lookups. These algorithms help determine the ideal index for each key, ensuring balanced bucket usage. There are many things that have been written from a broader POV as detailed articles are everywhere, all in all looking at the data structures like this has increased my understanding of problems related to data structures and how I operate with them.