#41: Arrays vs. Linked Lists
Welcome back to The Coder Cafe! As the Advent of Code has just begun, we will focus on algorithms and data structures this week. Let’s start today by discussing arrays vs. linked lists.
An array is a collection of elements stored contiguously in memory, meaning each element is stored next to the other. There are two types of arrays to consider:
Fixed-size arrays: Adding one element requires recreating a whole new array by allocating a new array in memory and copying over all the elements.
Dynamic arrays: Many programming languages offer a more dynamic version than fixed-size arrays, for example
ArrayList
in Java,std::vector
in C++, or slices in Go. Dynamic arrays typically allocate extra space to avoid resizing on every insertion.
A linked list consists of nodes where each node contains two things: some data and a pointer (also called reference) to the next node in the list. It’s important to note that linked lists have a memory overhead compared to arrays because each node needs extra space for the pointer. A specialized version of the linked list is the doubly linked list, which has pointers to both the previous and the next node.
The main differences:
Memory layout: Arrays are stored contiguously in memory. With linked lists, there’s no guarantee.
Access time: We can access any elements in an array instantly using its index. With linked lists, we have to start from the first node and follow the pointers one by one.
Flexibility: Fixed-size arrays have a fixed size upon creation. Yet, dynamic arrays and linked lists can grow and shrink more dynamically when it comes to adding elements.
In terms of time complexity (where n
is the number of elements):
Adding an element:
Fixed-size arrays:
O(n)
, since we must copy all the elements to the new array.Dynamic arrays:
O(n)
in the worst-case scenario when resizing is required, butO(1)
on average1 due to extra space allocation.Linked list:
O(1)
if we have a reference to the insertion point; otherwise,O(n)
since we have to find the position first.
Deleting an element: similar time complexity as adding an element.
Updating an element:
Fixed-size or dynamic arrays:
O(1)
since we can directly access the array’s index.Linked list:
O(1)
if we have the reference; otherwise,O(n)
to find it.
Find an element:
O(n)
for arrays and linked lists if the array is not sorted, otherwise, only arrays can provide anO(log n)
time complexity if the array is sorted.
Despite their theoretical advantages in time complexity for certain operations, linked lists can be slower than arrays in real-world scenarios. For example, why is there a performance penalty compared to arrays if we must iterate over all the elements?
The first reason is related to a lack of spatial locality. As we said, there’s no guarantee in linked lists that the elements are allocated contiguously. Therefore, when a CPU fetches a cache line when accessing an element, there’s no guarantee that the next element is part of the next cache line, hence leading to more cache misses.
NOTE: Not to mention that because of the potential lack of spatial locality, applications that iterate over linked lists can’t take advantage of of SIMD2 instructions, for example, to sum all the elements in the list.
Yet, in practice, if a linked list is allocated once by the same thread at the same time, it’s pretty likely that all the elements will be allocated contiguously in memory3. In the past, I designed such an experiment to benchmark the differences between:
Iterating over all the elements of an array.
Iterating over all the elements of a linked list allocated contiguously.
As a result, iterating on a contiguous linked list was 230% slower. What are the reasons?
First, the assembly code for the linked list iteration contains more cycles per iteration4.
Second, the iteration pattern is not predictable, preventing the CPU from prefetching data effectively5, resulting in increased cache misses and lower performance.
When choosing between arrays and linked lists, it's crucial to keep their limitations in mind. While linked lists offer better time complexity for certain operations, they can lead to lower performance in practice due to factors like cache misses and memory access patterns.
As a rule of thumb, consider using a linked list if your primary operations involve frequent insertions or deletions that are not at the end of the list6. Finally, if performance is critical for your application, always take the time to benchmark your assumptions and choose the data structure that best meets your needs.
Tomorrow, we will discuss the binary heap data structure.
Also called amortized time.
We will introduce SIMD in a future issue. For now, let’s just state that SIMD (Single Instruction, Multiple Data) is a technique that allows a CPU to perform the same operation on multiple data simultaneously.
Note that due to memory fragmentation over time, nodes may eventually become scattered even if allocated contiguously at first.
Prefetching is an optimization to anticipate which data the CPU will need in the near future to load that data in the cache before the application actually requests it.
Dynamic arrays are often a better choice for operations at the end of the list, as they provide an amortized O(1) time complexity for these operations.