Revisiting Threads – Overhead of explicit threads

Recently I had the good fortune to read some of the invaluable books such as CLR via C# by Jeffery Rictcher, C# in Depth by John Skeet and Writing High Performance code in .Net by Ben Watson. It allowed me to revisit some of the basics on Threads and I thought to write down my notes from the books. In this first part on Asynchronous Programming, we will begin by examining (or revisiting) internals of a thread and thereby understanding why creating explicit threads are such a bad idea.

Typical possible overhead of threads can be classified into two broad categories.
* Space , in terms of Memory Consumption
* Time, in terms of execution performace

Keeping the overheads in mind, let us look at what happens when a new thread is created.

Memory Allocation

For each new thread that is created, the operating system assigns each of the following data structures

Thread Kernel Object

Thread Kernel Object is a data structure/memory block allocated by the OS, which can be accessed only by the Kernel. The key objective of the Thread Kernel Object is to store information regarding the particular thread, including the thread context.The thread context includes states of CPU registers when the thread was last executed.

In addition the Thread Kernal Object also stores statistical information regarding the thread such as the Creation Time, State, Priority, Number of Context Switches done, Kernal Mode Time and User Mode Time among others.

Further more, the Thread Kernal Object also contains Stack pointer pointing to the starting location of stackframe of current function that is being executed in the thread and Instruction pointer to the current instruction that was executed by the CPU.It also contains address spaces refering the TEB and Stacks (User Mode and Kernal Mode).

Thread Environment Block (TEB)

The TEB, or Thread Environment Block is a block of memory allocated in the user mode (and hence accessible for application) for each thread which typically consumes 1 Page (4 Kb in most common processors) of Memory.

One of the key objectives of the TEB is to maintain a stack comprising of head of an exception handling chain. The node is removed each time the code exists the try block.

The TEB is also responsible for Threads Local Storage and data structures to be used for GDI/Open GL.

User Mode Stack

The User Mode Stack maintains reference to the address space indicating what the thread needs to execute once the method ends, which it removes when the method ends. It is also used for storing all the local variables and method parameters used in the method.

Windows by default allocates 1 MB per thread, but it can grow if the requirement arises.

Kernal Mode Stack

When the method access a Kernal Mode function, the arguements of the methods are stored in a different data structure called Kernal Model Stack. The application cannot directly access the Kernal Mode Stack. This is done for security reasons and during execution of Kernal functions, the OS copies the parameters from User Mode Stack to Kernal Mode Stack.

For a 32 bit System, the Kernal mode stack is typically 12Kb and 24Kb in case of 64 bit machines.

Unmanaged DLLs

One of the policies that Windows Operating System follows requires that for every new thread that is created, all unmanaged DLLs in the process should invoke their DLL_Main called with DLL_THREAD_ATTACH flag passed. Similarly, DLL_THREAD_DETACH is oassed when the Thread dies. This is required by some DLLs for initialization and clean up.

This,understandably has a performance implication every time a thread is created.

Context Switching

Every processor can run only a single thread at a time. Each thread is allowed to run for a specified sclice of time,(known as Thread Quantum) typically around 15-20 ms. When the thread quantum expires, the scheduler picks another thread from the another thread, allowing it to use the processor.

The OS Thread scheduler stores the kernel thread object in different queues based on the state of the thread (Ready, Waiting and Exiting). When the thread quantum finishes for a thread, the scheduler checks the Ready Queue, and picks a new thread causing a context switching.

Context Switching is the process of storing/restoring state of the given thread so that it can be resumed. This includes restoring the state of CPU registers with the states stored in Thread Kernel Object

Every context switching requires
* Save state of CPU registers for current thread in the Threads Kernel Object.
* Picks another thread.
* Load state of CPU registers for new thread, which has been previously stored in the new thread’s Kernel object

Additionally, when the context switching occurs, the CPU is already processing a thread and the executing threads code/data resides in the CPU’s cache. This is done to avoid frequent access to RAM, which is slighly slower compared to CPU’s own cache. CPU now must now access RAM to populate CPU’s cache

This whole proces has to repeat every 15/20 ms, which is a performance overhead. Obvious question that rises in mind is, wouldn’t that happen even with the Thread Pool.

The answer is Yes, but however, the one of the critical decission which the Thread Pool makes is maintaining optimal amount of threads. We will go into details of thread pool later, but the point of interest at this point would be how the thread pool ensures the number of threads remained optimal and doesn’t go out of hand. Also, with lesser threads, there would be higher chance for your thread to get an oppurtunity to schedule its run.

Garbage Collection

When the Garbage collector runs, the CLR suspends all the threads and walk through the stack to find roots to mark the object in heap. The GC would again walk though the stack again to update the roots once the objects has been moved.

This is another case where lesser or optimal number of threads would improve the performance.

Summary

All the above factors highlights why it is a bad choice to create threads explicitly. While threads are highly useful for employing asynchronous operations in your application, one needs to strike the right balance as far as the number of threads that are alive at a moment. Considering the amount of memory overhead required for allocating the thread, it would be highly useful if one could reuse the threads. This is exactly what the thread pool does.

Having said so, there are cases when creating threads explicitly could be recommended.
* By default, all thread pool threads are running in Normal Priority. When you need to run a thread in a non-Normal priority, you have the option to create explicit threads.

  • You need to create a Foreground threads. The threads in the threadpool are background threads.

  • If you have a extremely long running compute bound task, and you want avoid taxing the thread pool logic, you have a case where you could depend on explicit thread.

In the next part, we would examine Thread Pool and how it manages the optimal thread count balance.