📅 Last updated: March 10, 2025
Hello! Let’s keep exploring CPU concepts by discussing Simultaneous Multithreading. Note that if you’re not familiar with the concept of pipelining, I would recommend you read Instruction Pipelining first since it will be referenced multiple times throughout this post. Grab a coffee (or a tea), and let’s get started!
Providing the best performance has always been a major goal for CPU vendors. In this never-ending race for supremacy, one technology has emerged as a game-changer: Simultaneous Multithreading (SMT). In this post, we will start by understanding what SMT is and how it works, discuss performance and scheduling considerations, and conclude with thoughts on the future of SMT.
Understanding SMT
Before giving a definition that many would find too abstract, let’s first visualize the difference between a CPU without SMT and one with SMT.
First, let’s consider an example of a CPU without SMT:
This CPU consists of four physical cores, each equipped with one hardware thread (T1 stands for Thread 1). In the context of CPUs, a thread is the smallest execution unit within a core, responsible for fetching, decoding, and executing instructions. Additionally, each core is also composed of L1 and L2 caches, while the L3 cache is shared across all the cores.
NOTE: L1, L2, and L3 are types of CPU caches designed for quick access to frequently accessed data and instructions. In general, the closer a cache is to the thread, the lower the latency: L1 is the smallest and fastest while L3 is the largest and slowest.
Now, let’s consider a CPU equipped with SMT, like the Intel i7-7700K CPU:
Unlike a single-tread-per-core design, each core of the i7-7700K is composed of two separate threads, a feature enabled by SMT. SMT enhances CPU efficiency by allowing each physical core to process multiple threads concurrently—typically two—to maximize resource utilization and improve overall performance.
In the context of SMT, threads are also referred to as virtual CPUs. But why are they called virtual? Because they are logical constructs that abstract physical core resources. This abstraction allows the operating system (OS) to treat each thread as an independent processing unit, making eight threads visible to the OS instead of just four.
Origins of SMT
While you may not be familiar with SMT, perhaps you already heard about Intel’s first implementation called Hyper-Threading. Indeed, Intel was one of the first to implement SMT in 2002 with the Pentium 4 CPU.
To understand why Hyper-Threading was developed, it’s crucial to understand the historical context:
Back then, clock frequency was a primary selling point. For instance, the Pentium 4 was composed of a single core with a maximum clock frequency of 3.8 GHz1.
Higher clock speed often correlates with longer pipelines.
Yet, the drawback of long pipelines is an increase in branch misprediction penalties as a significant number of clock cycles are required to refill the pipeline (e.g., a pipeline with 20 stages requires at least 20 cycles to fully fill the pipeline.).
To address this, Intel integrated another thread that can still be effective while the other thread was stalled.
By allowing multiple threads to share a core’s resources, Intel paved the way for modern CPUs, shaping the development of multi-core and multi-threaded architectures that would define processors for decades to come.
How Does SMT Work?
We should clarify that SMT doesn’t involve duplicating an entire hardware thread. Let’s explore the inner details.
An OS is responsible for scheduling applicative threads across the available hardware threads. When multiple applicative threads share a hardware thread, the OS ensures fair CPU access. The action of switching from one thread to another is called context switching.
NOTE: We already discussed fairness in Safety and Liveness.
Context switching between applicative threads isn’t a free operation as it requires saving and restoring a collection of information such as CPU registers and the program counter. Hyper-Threading alleviates some of this overhead, allowing alternate execution without full context switching.
When we talk about the collection of information restored following a context switching, it is referred to as the architectural state. For CPUs without SMT, there’s only one architectural state per core. Yet, CPUs with SMT have two architectural states per core.
In Instruction Pipelining, we mentioned2 that the Execution Unit (EU) was responsible for executing instructions. Actually, a physical core is composed of multiple execution units, including:
Arithmetic Logic Units (ALUs) for integer arithmetic
Floating-Point Units (FPUs) for floating-point calculations
Load/Store Units (LSUs) for handling memory operations
With SMT, hardware threads maintain separate instruction queues fed by the Decode Unit (DU). Between the execution units and instruction queues lies the instruction scheduler in charge of selecting ready instructions from these queues. The instruction scheduler operates as follows:
If one thread is stalled, it switches to the instruction queue of the other thread
If both threads are active, it alternates between them each cycle
Because each thread has its own architectural state, as we mentioned above, execution units always have the necessary context to process instructions without requiring a context switch.
Here’s a visual representation of all the elements discussed in this section:
As we can see, the instruction scheduler selects instructions from either thread 1 or thread 2 and routes them to the appropriate execution units, which execute them without context-switching overhead, as each thread's context is already maintained in the corresponding architectural state.
In summary, SMT enables a CPU to run multiple threads per core without the overhead of frequent context switches. However, SMT doesn’t provide true parallel execution of threads in the same way that is possible with multiple physical cores.
Performance
While it depends on many criteria, such as the CPU architecture of the workload type, Intel usually cites a maximum of 30% performance gain with two threads on one physical core compared to a single thread. Yet, the counterpart is a 20% increase in terms of power consumption.
Let’s consider a workload running on an i7-7700K on a single thread. If we parallelize our workload on the eight available threads (assuming a theoretical optimal parallelization) it means it won’t run at 800% efficiency. Instead, we can only expect a theoretical 520% efficiency (130% x 4). It’s worth keeping in mind that you should not expect perfect linear scaling.
As always, it’s best to benchmark your application rather than relying on theoretical gains. Indeed, enabling SMT won’t necessarily improve performance. For example, while this report states that CPU-bound workloads benefit from SMT, some memory-bound workloads perform worse when SMT is enabled.
Linux kernel Scheduling
Consider two threads, whether they belong to the same process or not. When SMT is enabled, do you think the Linux kernel will favor:
(A) Scheduling these two threads across different physical cores whenever possible?
Or (B) Scheduling these two threads to the same physical core?
The answer is A. Indeed, the kernel has a strong bias towards spreading threads across different cores whenever possible. This is primarily because the kernel’s main goal is to maximize utilization across all available cores, which improves power efficiency.
It’s also worth noting core scheduling, a Linux feature that allows defining groups of related threads that can share a core. For example, if two threads frequently access the same data, assigning them to the same core can sometimes be beneficial. This approach helps reduce cache misses and mitigate performance penalties, such as false sharing, where cache lines are inefficiently bounced between cores3.
The End of the SMT Era?
It’s been a while since clock frequency alone was the main selling point of CPUs. Today, with factors like energy efficiency and data center costs playing a bigger role, the focus has shifted towards performance-per-watt4 rather than just raw speed.
As mentioned earlier, the main drawback of Hyper-Threading, according to Intel, is a 20% increase in power consumption. If we look at modern CPUs such as Intel’s 12th generation Alder Lake, we can see how this shift is shaping processor design, specifically with the introduction of two distinct core types:
P-cores (Performance cores) have Hyper-Threading (2 threads per core) because they are designed for high-performance tasks
E-cores (Efficiency cores) do not have Hyper-Threading because they prioritize performance-per-watt
NOTE: As discussed above, higher clock frequency correlates with speed improvements, longer pipelines, and increased power consumption. This is why P-cores operate at higher frequencies, feature longer pipelines, and require more power.
One of Intel’s latest releases, Lunar Lake, takes this trend even further—completely removing Hyper-Threading in favor of more cores and architectural improvements. Intel even claims that Lunar Lake is 20% faster than the previous generation of CPUs with Hyper-Threading. This suggests that Intel’s philosophy has shifted towards scaling core count (especially E-cores) and architectural improvements, rather than relying on Hyper-Threading to boost performance.
Final Thoughts
SMT has played a crucial role in CPU design for over two decades, enhancing efficiency in specific workloads. However, as power efficiency and core scaling become more and more important, its long-term relevance is increasingly uncertain.
While SMT still has its place for specific workloads, the industry seems to be moving toward more cores, smarter scheduling, and architectural advancements rather than relying on SMT. Are we witnessing the decline of SMT? It may not disappear entirely, but its golden era could be behind us.
Explore Further
Intel Hyper-Threading technology // Intel’s technical user guide if you want to delve into Hyper-Threading.
Core scheduling // Linux core scheduling feature.
Two threads, one core: how simultaneous multithreading works under the hood by
// This post is clearly exceptional. If you are interested in low-level stuff, I strongly recommend you subscribe to his newsletter, Confessions of a code addict.
Have you experienced performance differences when enabling or disabling SMT on your system?
To give perspective, one of the latest Intel’s desktop CPUs (Core Ultra 9 285k) features a base frequency of 3.7 GHz.
Mainly to simplify things.
We will discuss false sharing in a future post.
The measure of how efficiently a processor converts electrical power into computational performance.