High Level Design of STLshm
I need to add a quick-start howto. However, for those of you would like to know how/why something works, and not just how to use it, the following should tell you enough about the implementation to get you going. There are always the example processes in the examples directory.
STLshm makes use of several low-level layers of the ACE library to provide the implementation of shared memory regions, relative pointers, lock types, and a basic malloc() algorithm for free list management. The structure of these elements includes a well defined interface, and makes extensive use of generics to ensure that the different concepts remain well encapsulated in the implementation, and thus replaceable. For example, it should be possible for a user to replace the malloc algorithm without having to change how the shared memory region is implemented, or how the shared lock implementation is implemented, and so on.
The structure of the STLshm system is based on the layers of memory management used in OS designs. These layers used in this project are as follows:
1. The lowest level deals with the management of the shared region itself. This includes the ability to create and attach to shared memory regions (and coordination of the race conditions associated with creating it). It also includes the mechanism to increase the size of the region. In a typical heap-based solution, this corresponds to the functionality of the sbrk() call which a process would use to control its allocated heap space.
Like heap allocation on a lot of systems, the shared region growth is one-way: The region can only get bigger. However, the convention I’m going for here is just to enable the region to grow to whatever size is needed, without everything having to be sized in full from the start.
2. The free list (memory management logic) for the memory within the shared memory region. This layer corresponds to the implementation of malloc() and free() logic in a traditional heap-based memory management scheme. The solutions which apply for this layer can exploit the same concurrency capabilities, and fragmentation avoidance strategies which have traditionally been used with heap memory management. In fact, the malloc problem is not really any different in this case, except in regard to its need to work with the shared memory region.
(i.e. the malloc algorithm needs to address the problem of pointer representation and concurrency control, but both of these can be made policy classes, permitting malloc algorithms to be used readily in both
3. The mechanisms that support the use of STL containers in shared memory. This includes a relative pointer class, and an appropriate STL allocator. It also includes the primitives for inter-process concurrency control. Fundamentally, the intention with these elements is that the use of an STL container in shared memory should not be significantly different than cases of using the STL container in heap with multiple threads. i.e. The primary issue is the concurrency control for the container.
The STLdb project seeks to further extend the material here and adds STL-like containers which support transactions, including write-ahead logging, checkpoints, etc.
Implementing Shared Memory Regions
The types supported by the ACE library:
1. SVR4 shared memory
2. Unix memory maps
3. Anonymous regions in the windows pagefile
There are a number of design issues which have to be addressed in the implementation of the shared memory region.
Design Issue 1 - same virtual address or different virtual addresses
Many POSIX -compliant shared memory mechanisms allow an application to designate the desired virtual memory address where the region should be mapped. I first saw this convention exploited by the ACE library. Although considered non-standard, it is actually fairly widely supported over a number of UNIX OSes. Typically, as long as the application assigns a virtual address which can fit the shared region, and which is properly aligned on a page boundary, then the address provided is accepted. As OSes move toward 64-bit compliance, applications would generally have plenty of virtual address space, and thus the ability to pick regions of their virtual address space to reserve for use in attaching to some shared memory implementation. The advantage of using a common virtual address for the mapping means that pointer values could be put directly into the shared region and be correct for all processes using the region. This in turn helps to minimize the impact to application data types used within shared memory.
There are cases, however, where an application might have difficulty using an agreed upon address. This can be a side effect of the application itself being highly configurable, or needing to change the mappings which it is using over time. So the solution must address the issues associated with allowing applications to map in shared regions in such a manner that the OS chooses the address, and thus applications may be using a common shared region, at different virtual addresses.
The key implication of allowing processes to map in the region at different addresses is the representation of a pointer in the shared region. An absolute value cannot be used, because it will be correct for one process, but wrong for the other. Thus, the solution includes a “Relative Pointer” type (the ACE class “Based_Pointer_T<T>”, which is used as a pointer-to-T in shared memory. This can be set as the pointer type member of STL allocator classes, thereby conveying to STL containers in shared memory that they should use that pointer type.
Design Issue 2 - handling growth of the shared region
With heap memory, if a thread in the process decided to increase the heap size using the sbrk() function to add more memory to the free list, it is immediately recognized and available to all other threads, as they all share a common heap.
However, with shared memory processes, if one process increases the size of the shared memory region, other processes using the region do not automatically notice and increase the size. If unchecked, a process could end up following a pointer to a newly allocated part of the region, thereby experiencing a SEGV.
There are at least two ways to deal with this problem, from a design point of view.
Approach 1: Segmentation
Our shared pointer implementation can employ a memory segmentation approach. i.e. A shared memory pool is divided into ‘segments’, all of which are the same size. Initially, the region is allocated to consist of a starting number of segments. Then as memory demand continues to grow, new segments are allocated and added to the shared memory pool. For a given shared memory pool, there is an array holding segment addresses. The relative pointer holds a segment_id (index into that array), plus an offset. So the actual address is computed “fairly quickly” via segmap[this->segment_id] + this->offset. If segmap[this->segment_id] evaluates to zero, then some other process recently added a segment to the shared region, and the current process needs to map it in. So the code that yields the T* looks something like:
T* ptr = reinterpret_cast<T*>( segmep[_segment_id] + _offset);
If ( ptr < segsize ) {
// need to map in the newly discovered segment.
}
return ptr;
While this approach has slightly more overhead than approach 2, it does have some advantages:
• The shared memory doesn’t have to occupy a contiguous, growing region of the processes virtual address space. Each new extension can be independently mapped into the process at an address of the OSes choosing. This is in contrast to the second approach (below) in which each new extension must be mapped in at the end of the existing region. (i.e. The address of the first byte of the region must not change as the region grows in size.)
• Not relying on SEGV handling to determine when you have traversed a pointer into a portion of the shared region that you haven’t mapped yet. This makes this approach slightly more OS independent.
Approach 2: Detecting growth via SEGV
The ACE library presented a clever way out of this problem. Rather than attempt to determine if pointers ever point beyond the currently mapped region, it allows the CPU to do that for you. Rather than seek to prevent a SEGV, it uses the SEGV to detect when a remapping may be necessary. A SEGV signal handler intercepts the segv and determines if it is at an addresses beyond a currently mapped region. It then checks the regions total size and determines if the SEGV could be corrected by remapping the region. If so, then it does so and returns. The effect of returning from a SEGV handler causes the CPU to retry the memory access, and in this case it will be successful and continue. This approach minimizes the overhead associated with detecting a resize (increase in size of) the region, allowing pointer traversal to operate at optimum speed.
The is one requirement which this approach imposes on the way in which a shared region is grown. It's an important point: The act of re-mapping a region whose size recently increased must not change the base address of the start of the region. If it did, the SEGV would be resolved, but the relative pointer itself would need to have its value recomputed, and this isn't really possible by the time that it is detected.
For most shared memory implementations, this is a reasonable and possible limit. As 64-bit architectures become standard, there will be lessconcern about running out of address space.
The Shared Memory Region Interface
The different types of shared memory classes are based on a common interface. The intention behind this is to enable some C++ equivalent of dependency injection to be used to allow the application to be configured at run-time to use the optimum implementation.
Because different low-level parameters are required to permit the creation of the different regions, it follows that each type will have its own configuration data and identifying key type. For example, shared memory uses an integer region ID, while a memory map would identify the region using a filename. These differences make it somewhat difficult to create a truly polymorphic interface common to all classes. The design of the interface for these classes reflects the desire to support this polymorphism.
class SharedRegionInterface
{
SharedRegionInterface();
~SharedRegionInterface();
enum attach_options {
create_or_attach,
create_only,
attach_only };
attach( const char *region_id,
const size_t initial_size,
const size_t default_growth_size,
std::map<string,string> *addtl_properties = null,
attach_options opt = create_or_attach );
void *baseAddress() const;
size_t size() const;
void resize( const size_t growth_size = 0 );
void detach();
};
The addtl_properties parameter of the attach method allows parameters that might be required by specific implementation to be passed in to the attach method. For example, a memory mapped implementation could take the flags which specify the file protection attributes to be passed to a create() call for the new file. The attach method is required to be appropriately synchronized so that if multiple processes simultaneously invoke it, the result is that when several applications invoke it with opt=create_or_attach, one takes the role of creator, and the others take the role of attaching to the existing region. For cases where a specific process is supposed to create the region, and others are supposed to attach to it, the other attach_options enum values can be passed to the attach method.
One of the 'fun' problems with implementing the attach() method is the synchronization. The method can't use a shared process mutex for synchronization since that would simply push the problem further up the stack: i.e. How is the creation of the shared process mutex itself synchronized? Ultimately, the synchronization will be primarily around the activity of creating the region, not attaching to it. Creating the shared region includes the task of creating and initializing an appropriate mutex for co-ordinating further modifications to the shared region. This initial safe construction of the region works as follows:
1. The processes attempt to create the region using a primitive of the OS which is itself concurrency controlled. For example, the open() command with the CREAT flag specified. The OS primitive succeeds for one of the processes, it proceeds to step 2. The processes which don't succeed instead attach to the region. After attaching, they check the validity byte of the region (first byte in the region) and spinlock on it until it holds its expected value.
2. The process which created the region then initializes the mutex in the region which will permit standard concurrency control over the region. After initializing the mutex it then acquires it, and then sets the validity byte to the value indicating that the mutex has been created.
3. The other processes will eventually notice that the validity bytes has been set. They then can acquire locks on the inter-process mutex and proceed normally.
Note that based on this algorithm, there is a risk, that if the process attempting to create the region is killed during these steps, then the other processes will be in an infinite loop spinning on the validity byte. The quality of the code produced must ensure that this can only happen in the case of an untimely 'kill -9' on the process.
Note also that on SMP systems, a spin lock on the validity byte would not necessarily see the modified value. One of the tasks of the mutex in SMP systems is to enforce a memory barrier, ensuring the synchronization of the CPU caches. One of the reasons for having the inter-process mutex and the validity byte in close proximity is so that the would be on the same cache line of the memory space, and thus the use of the mutex in step 2 above would also help to ensure that other processes will see the validity byte value change.
Design of the Relative Pointer
The previous sections established the need for a "Relative Pointer" - a pointer type which works when it is in shared memory, and can point to another object in the same shared memory region regardless of their mapped addresses by using a pdiff_t value (a relative offset) to indicate the other address.
Thus the relative pointer uses an offset from its own address (this) as a way to indicate the destination that it points to. This allows a short and fast bit of pointer arithmetic to be used to determine the actual T* value. It also means that the offset has to be recomputed anytime that the pointer is copied.
I had several goals in the design of this pointer type:
1. Speed. i.e. Absolutely minimal overhead. I figure that computing the actual address via a single line expression ( this + offset ) is about as fast as this implementation can get. Combined with the fact that it doesn't have to do bounds checking because of the use of a SEGV signal handler means that the use of this pointer should not represent any significant overhead.
2. Size. Ideally, the RelativePtr<T> should occupy the same amount of space as a regular pointer. Typically, ptrdiff_t (the type for pointer difference) is the same size as a pointer value (for mathematical reasons), and thus this requirement can be met by restricting its storage to a single value of that type.
3. Works anywhere. In the course of working with RelativePtr<T> values, they may get copied into stack variables, temporarily. For example, a std::string that uses the STLshm allocator and thus RelativePtr<T> might get copied into local memory as a temporary copy while the application works with the data in an STL container. The RelativePtr<T> must be usable when this happens. The only restriction with regard to the use of RelativePtr<T> types is that they shouldn't be used to point from one shared region to another, unless the application can guarantee that the regions will always be separated by a constant distance.
RelativePtr<T> has a minimal overhead/footprint. It has one data member - a ptrdiff_t _member. At first, you might assume that the value of Null is represented by using a ptrdiff_t value which causes the expression ( this + offset ) to evaluate to 0. However that doesn’t work unless the value of this is the same for all processes. (And if that was true, you wouldn’t need the RelativePtr.) One idea that I originally tried to work with was having the RelativePtr<T> point to itself if the value was Null. After all a RelativePtr<T> must point to a T, not to T* or a RelativePtr<T>. The type semantics suggested that nobody would do that. Unfortunately, with some circular list implementations, the pointer can end up pointing to itself, and this was a source of one fun bug. Using a value like -1 could likewise turn out to not work if you ever had something like:
template <int N> struct buffer {
char buffer[N];
RelativePtr<char> nextLoc;
}
i.e. At some point, nextLoc might point to buffer[N-1] in which case it would be incorrectly interpreted as Null.
So the best (foolproof) option is to use a bit for it, and give up on the convention of trying to make sizeof(RelativePtr<T>) == sizeof(T*). However, the current implementation uses the convention that an offset address of 1 equals NULL. The reason is that the likelihood of the pointer of type T pointing to its own address in a way which is also in violation of the alignment boundary of its own type seems extremely unlikely. I couldn’t think of a condition where it would yield a false NULL value. (If you think of any, please e-mail me.)
The only operations supported by this pointer type are its conversion into T* and assignment from T* and from other RelativePtr<T> values.
The Malloc algorithm
Now that we have designed a class which can represent an expandable shared memory region, and a pointer implementation, the next piece we will need is a malloc algorithm.
Ideally, I would like to reuse existing malloc algorithms, and I can certainly reuse their design, but not their implementation. The key issue with implementation reuse is the need to use RelativePtr<> for the freelist, and an inter-process mutex type for concurrency control.
With the initial project, a simple malloc algorithm will be provided which maintains the freelist in the form of a circularly linked list, in which each new search for memory begins where the previous one left off. The algorithm is first fit (optimized for speed at the expense of possible memory fragmentation.) A single mutex is used for access to the freelist.
Future optimizations include:
1. Improve concurrency by adding read/write locking, or by coming up with a way to segment the freelist into independently locked sets.
2. Provide a best fit approach.
3. Make it compatible with the Loki library’s small object allocator.
(Note: Boost.interprocess already has multiple malloc implementations, I think, so I need to adopt the use of them.)
The malloc interface:
class some_malloc_implementation
{
private:
// a bunch of state data, including pointers to freelist, etc.
// all pointers in here should be RelativePtr<> types.
public:
// Malloc should return NULL if there is no memory available, thereby
// triggering higher level logic to resize the region.
void* malloc(size_t amount);
// free should only be called with allocated memory addresses as previously
// returned by malloc.
free(void *addr);
// For higher level logic:
// Add the region of memory starting at addr, which is 'size' bytes long to
// the freelist.
increase_freelist( void *addr, size_t size );
}
The facade that ties it all together
So now we need a final class which ties together the other classes which we have talked about thus far, and serves to orchestrate the process of attaching to a region, finding the object in shared memory which provides the free-list implementation, and so on.
template <class Lock, class ShmImpl, class Malloc>
SharedRegion
{
// Attach to the shared region
SharedRegion( const char *region_id,
const size_t initial_size,
const size_t default_growth_size,
std::map<string,string> *addtl_properties = null,
attach_options opt = create_or_attach );
// detach from the region
void detach();
// request some memory
template<class T=void>
T* malloc( size_t amount );
// free up that memory
void free( void* addr );
/* The following are available if needed */
// What is the base address of the shared memory region.
void *baseAddress() const;
// How big is the shared memory region, currently.
size_t size() const;
// Pre-emptively increase the amount of shared memory in the
// shared memory region.
void resize( const size_t growth_size = 0 );
// Get/Set region growth size (amount by which the shared regions
// size will increase when malloc requests more memory.
size_t getGrowthSegmentSize() const;
setGrowthSegmentSize(size_t size);
};
Designing the STL shared memory Allocator
The stl allocator 'concept' requires that your allocator type have a default constructor. The problem is not so much with containers, which can be given a specific object when constructed, but with the data objects managed by the container. i.e. Data objects within a container may be constructed under normal circumstances without the benefit of passing the allocator object to them. For example, In the case of a map<int,string>, there will be cases where a string gets created via its default constructor. e.g. map[1] = "Hi";
However, even when the allocator of contained objects is constructed in this way (via default constructor), we will still need that allocator to realize that it should allocate memory from the shared memory region. For our shared memory mechanism to work, an allocator object must be able to identify the shared memory region that it will work with when it has been constructed using a default constructor. Based on the C++ language, I think this leaves us only the following two options:
1. try to determine the region from the value of 'this' - which is possible if the allocator is created in such a region.
2. rely on static data (or thread-specific data) to inform the constructor.
I wanted the ACE_STL_Allocator to support the simultaneous use of multiple Memory Pool regions in a program. As such, just giving the class a single static value (pointing to the shared memory region’s malloc object) that is unchanging didn't seem like the best option. It might be reasonable, however, to have a version of the allocator which uses thread-specific memory so that individual threads can designate the shared region that any newly constructed allocators should be using.
I tried (a) first, because (b) implies complications for processes with multiple regions mapped in. It did work fairly well, but based on its initial design, there were several occasions when it still ended up allocating material in heap. This occurred when temporary stack-based variables of the types used with the map<> template got instantiated.
The interesting question is: Can (1) be guaranteed to work? Eventually, all data objects which exist in the shared memory region get constructed there, and might get assigned to. Thus, as long as the ACE_STL_Allocator() can determine its shared memory region upon construction by using the base address registry, it should be possible to make this guarantee, if:
1. The default constructor of ACE_STL_Allocator determines the region it is within.
2. The copy constructor does not inherit the region from the variable passed as rarg. Basically, the copy constructor has to act like the default constructor.
3. The assignment operator likewise avoids rarg, just to be safe.
4. The type using the allocator doesn't try to do something too clever, like a 'copy on write' convention in which objects in shared memory would end up having pointers to objects in heap.
Shared Region manager
The shared memory region manager class is a complement to the SharedRegion<> class - a necessary set of static data. It is a singleton. It provides a place for all SharedRegion<> classes existent in this application to register themselves and their current base address. Based on that, if a SEGV is triggered in what appears to be an area slightly beyond the end of a shared region, this class catches it and invokes the SharedRegion's check_resize() method. If the region has had its size increased, then the manager's SEGV handler will return, allowing the address to be re-evaluated and rendered addressable.
The application doesn't need to interact with the manager directly, but it does have another
value which is convenience: it can return the SharedRegion<> class based on its name, helping to avoid cases where multiple SharedRegions<> get attached to a region. It can also perform the detach() operation when it's singleton is destroyed, ensuring proper region closure (which might be important for some regions.)
The ACE implementation of the above concepts:
The project will build on top of ACE constructs for the implementation of Malloc over Shared Memory.
Memory Pool Implementations (all 5 included by Memory_Pool.h)
1. Local_Memory_Pool(.h) - heap-based pool
2. Shared_Memory_Pool(.h) - shm-based pool
3. MMAP_Memory_Pool(.h) - mmap-based pool
4. Sbrk_Memory_Pool(.h) - sbrk()-based pool
5. Pagefile_Memory_Pool(.h) - Pagefile-based pool
The above are 5 different shared region implementations. Their interfaces are not precisely identical, for example, they each have their own ‘options’ types. However, their interfaces are similar enough to permit the use of one of these 5 types as a template parameter to the Malloc_T façade (see below).
Relative Pointer Implementations
1. ACE_Based_Pointer_Basic<T> - pointer to type T (Based_Pointer_T.h)
2. ACE_Based_Pointer<T> : public ACE_Based_Pointer_Basic<T> - pointer to T (Based_Pointer_T.h)
The implementation of a ‘Relative Pointer’ in the ACE library. The use of relative pointers is based on the previously described SEGV handling behavior, which is accomplished in ACE via a class which knows about all mapped in shared memory regions which might require the SEGV behavior.
Malloc Control Block Implementations
1. ACE_Control_Block (Malloc.h) - Normal control block is for scenarios which use normal pointers in the memory pool.
2. ACE_PI_Control_Block (PI_Malloc.h) - Process Independent Control Block, based on the ACE_Based_Ptr<T> relative class.
The implementation of malloc depends on a control block, and associated ‘nodes’ in a free list. There are two implementations of the control block offered with ACE, one which uses absolute pointer values, and one which uses position independent (PI) pointer values, via the ACE_Based_Ptr<T> template. My convention has been to always use the position independent form.
The Malloc Algorithm and associated Façade
1. ACE_Malloc_T<MemPool,Lock,ControlBlock> (Malloc_T.h)
2. ACE_Malloc<Pool,Lock> : public ACE_Malloc_T<Pool,Lock,ACE_Control_Block> (Malloc_T.h)
ACE has coupled together the façade and the Malloc algorithm into a single class (Malloc_T<>) The Malloc_T object constructs the pool, and constructs its own Lock. The Lock object can’t be allocated in the pool, as it is used to help control the initialization of the control block, and is thus used before any malloc() algorithm is available.
The sequence of construction enforced by Malloc_T is:
1. First the Pool is created.
2. Then its lock member is created.
3. Then acquire the lock
4. Create or Re-acquire the Control Block address via call to pool::init_acquire()
5. Pool::init_acquire is what actually opens/maps in the region (in the case of a mmap impl.)
6. Malloc_T then initializes the control block if this is a first time construction, otherwise, it reuses the control block which already exists.
The destruction sequence enforced by Malloc_T ensures the shared region is detached without destroying its contents. i.e. No destructor is ever called on the control block within the region. The only thing that ~Malloc_T does do is destroy its associated lock (the ACE_Mutex object), which has the effect of reducing the locks reference count unti no process is using it (i.e. every process has called the destructor on their Malloc_T object.) At that point, the actual lock in the ACE_Mutex managed mmap is destroyd along with the backing file.
Allocator interface classes:
1. ACE_Allocator (abstract class) (Malloc_Base.h)
2. ACE_Allocator_Adapter<Malloc> : public ACE_Allocator (Malloc_T.h)
ACE_Allocator is a fully polymorphic interface for allowing a program to be written using an Allocator, whose run-time instance might be shared memory based, heap based, etc. ACE_Allocator_Adapter<Malloc> works with a Malloc_T<> type to adapt it to the ACE_Allocator interface. This combination is used with my own SharedRegionManager class to make the option of changing the type of shared memory region at run-time a reality.
ACE Lock Types
1. ACE_Mutex - thread or process specific (Mutex.h) (either concrete or a proxy, depending.)
2. ACE_Thread_Mutex - thread mutex (Thread_Mutex.h)
3. ACE_Process_Mutex - process specific mutex (Process_Mutex.h)
4. ACE_RW_Mutex - Read/Write pthread mutex (RW_Mutex.h)
5. ACE_RW_Thread_Mutex : pubic ACE_RW_Mutex (RW_Thread_Mutex.h)
6. ACE_RW_Process_Mutex - process based read/write pthread mutex (RW_Process_Mutex.h)
7. ACE_Condition_Thread_Mutex - (Condition_Thread_Mutex.h)
8. ACE_Process_Condition - unimplmented (Condition_Thread_Mutex.h)
9. ACE_Condition<ACE_Recursive_Thread_Mutex> - (Condition_Recursive_Thread_Mutex.h)
Multi-process concurrency control is essential. Thus, we make use of the relevant ACE lock types.
ACE_Mutex: when constructed with a USYNC_THREAD parameter, this object is a concrete mutex object. i.e. The mutex variable is a member of the ACE_Mutex. However, when used with USYNCH_PROCESS, it is a handle(pointer) to a mutex which is itself contained with a small shared (mmap) region of memory under the control of the ACE_Mutex instance. Also, I extended this class so that an instance which allocates the mutex in a shared memory region is possible. My variation of ACE_Mutex is usable for the complimentary locks (or member locks) of objects which are allocated within the shared memory region, such as our map<> and other data structures.
We also will want to take advantage of ACE_RW_Process_Mutex, and ACE_Process_Condition (which I think we will have to implement)
ACE Bugs / Custom Extensions:
• Rewrote portions of ACE_Mutex to allow the process mutex data to be allocated within an ACE_Allocator controlled region, (with name-based binding), rather than via an independent memory mapped file.
• ACE_Malloc_T<> has a bug in which the ref_count is never decremented, even when the object is destroyed. Not really a big problem, since the ACE_Malloc_T type never uses that to determine whether or not to destroy something.
• ACE_Malloc_T<> could really use a method like ‘findOrCreate<T>( name )’ which atomically finds an object within the named list, and if not found, allocates the memory for it, and constructs it using the default constructor. All done within one lock context, eliminating the need for applications to repeatedly do this. I have temporarilly created a type called ‘AllocationHelper” which offers this method for an ACE_Allocator controlled region.