Saturday, September 13, 2014

Modern C++ at work to build extensible serialization system

In this post I will present general C++ techniques to achieve static polymorphism and static dispatch that can be applied in wide range of applications. I will also present a possible implementation, maybe oversimplified, for classes from C++ standard library. Don't implement them by yourself! Use them from the standard library! I implemented them to provide an introduction for the more advanced topics presented at the end.

As a teaser you can check the full sample code on github, but I strongly recommend to read the full post before doing so because the code might look difficult to understand otherwise.

The serialization is used just as an practical example for why we need to know these techniques.
I won't talk about what type of serialization is better - text or binary nor about the portability or extensibility of serialized data because it's not the purpose of this post. If you need a well tested serialization system you can take a look at Boost Serialization or Google Protocol Buffers. For the sake of simplicity I will use binary serialization and std::streams

So... what is serialization? According to Wikipedia Serialization is:
... the process of translating data structures or object state into a format that can be stored (for example, in a file or memory buffer, or transmitted across a network connection link) and reconstructed later in the same or another computer environment.
I will start with a simple example - how to serialise one int:

void Serialize(std::ostream& stream, int value)

{    stream.write(reinterpret_cast<char*>(&value), sizeof(value));
}
void Serialize(std::istream& stream, int& value)
{    stream.read(reinterpret_cast<char*>(&value), sizeof(value));
}
And some use/test case:

{
    std::ofstream stream("test", std::ios::binary);
    Serialize(stream, 123);
}

{
    int someValue = 0;
    std::ifstream stream("test", std::ios::binary);

    Serialize(stream, someValue);
    assert(someValue == 123 && "Ops! My code sample is not working...");
}

What about serializing a float? Yes, this is simple - just replace int with float and everything works! But... wait?!? How many fundamental types do we have? According to C++ standard, section 3.9.1 Fundamental Types we have 5 signed integer types, 5 unsigned integer types, 3 floating point types, one boolean type, void and nullptr_t - the last 2 types are not of interest for us. The character type char is counted with integer type and wchar_t is counted with its underlying type. To make an overload of Serialize function for all these types one has to type a lot! 

Let's make the compiler do that for us:

template<typename T>
void Serialize(std::ostream& stream, T value)
{
    stream.write(reinterpret_cast<char*>(&value), sizeof(T));
}

template<typename T>
void Serialize(std::istream& stream, T& value)
{
    stream.read(reinterpret_cast<char*>(&value), sizeof(T));
}
This is cool, but it has a problem - the function will be generated for all fundamental types as we wanted, but also for user defined types. Is there a way to stop the compiler doing such a bad thing?We can try to convince it to do so...

First we need to make the compiler to answer following question: "is the type we are using fundamental or not?"
The solution is simple - we define a general template "question" that answers false for all types and we make a specialisation to return true for types of interest:

template<typename T>
struct is_fundamental
{
    static constexpr bool value = false;
};

template<>
struct is_fundamental<int>
{
    static constexpr bool value = true;
};
This works because the compiler tries first to instantiate the most specialised version of a class or function. If the code is well formed, that class or function is selected. If the code is not well formed, the compiler tries the next most specialised version and so on until it reaches one well formed class or function or there are no more versions to try. The rules of selecting the most specialised class or function are pretty complex, but most of the times is good enough to know that the version with less template parameters (0 is a valid count!) is the most specialised case - in our case template<> struct is_fundamental<inthas no template parameters.

As you can see typing static constexpr bool value = true/falseis repetitive and we don't like to type same thing over and over. One simple solution is to make a base class that has a constant member value which is true or false and derive from it:

struct false_type
{
    static constexpr bool value = false;
};

struct true_type
{
    static constexpr bool value = true;
};
Now our is_fundamental definition becomes:

template<typename T>
struct is_fundamental : false_type{};

template<>
struct is_fundamental<int> : true_type{};
Now the definition of is_fundamental is shorter, but we still wrote 2 times static constexpr bool value = true/false;.  Can we do better? Sure we can! In C++ templates don't accept only types - they also accept integral values. Let's make a struct that accepts one integer type and one integer value as parameters and declares a member constant of the provided type using the provided value:

template<typename T, T val>
struct integral_constant
{
    static constexpr T value = val;
};
Now true_type and false_type become:
typedef integral_constant<bool, true> true_type;
typedef integral_constant<bool, false> false_type;
This is very cool - we wrote only once static constexpr T value, but what code is the best code? The code one doesn't have to write! Yes, we don't have to write this code because the standard library implementers already did this for us! In <type_traits> header we have all this functionality and many more! Very well implemented and tested! Just use std::integral_constant and std::is_fundamental!

Now that we know if a type is fundamental or not. We need find out how to make the compiler generate Serialize only for fundamental types? There are several ways of doing this. I will present here only one, using the return type of the function. It is based on a technique called SFINAE. We need to generate a valid return type for Serialize function if and only if std::is_fundamental gives us true. To do so we use same general/special case template technique as we used for is_fundamental.

template<bool Condition, typename T = void>
struct enable_if
{
};

template<typename T>
struct enable_if<true, T>
{
    typedef T type;
};
In this case if the condition is true and only then the member type is defined. If the condition is false, the general case is used which doesn't have member type defined. We can use it like this:

typename enable_if<std::is_fundamental<int>::value, int>::type a = 0;
If for some weird reason std::is_fundamental<int>::value is false, this code is not well formed and the compiler will give an error. If the same type is used as a return type for template function the function will be well formed only when std::is_fundamental<int>::value is true. When std::is_fundamental<int>::value is false the function is not well formed but we won't have a compilation error (remember SFINAE) - the function is just ignored.

To type typename and type is tedious (remember we don't like to type!). One solution to avoid this is to make an alias for it:

template<bool Condition, typename T>
using enable_if_t = typename enable_if<Condition, T>::type;
Now the usage becomes:

enable_if_t<std::is_fundamental<int>::value, int> b = 0;
As with is_fundamental, the standard library implementers are very kind and they already wrote std::enable_if code for us. Unfortunately the helper enable_if_t is not in the standard library, but it will be soon, when C++14 will be adopted.

Ok, let's get back to our Serialize function. Now the definitions are:

template<typename T>
enable_if_t<std::is_fundamental<T>::value, void> Serialize(std::ostream& stream, T value)
{
    stream.write(reinterpret_cast<char*>(&value), sizeof(T));
}

template<typename T>
enable_if_t<std::is_fundamental<T>::value, void> Serialize(std::istream& stream, T& value)
{
    stream.read(reinterpret_cast<char*>(&value), sizeof(T));
}
I know they don't look nice, but is way better than writing and maintaining 10-20 overloads for all fundamental types. Also this helps the compiler (at least clang) to give good error messages when user defined types are used:

error: no matching function for call to 'Serialize'
note: candidate template ignored: disabled by 'enable_if' [with T = Point]
What about the user defined types, how do we serialize them? One solution is to provide a non member Serialize function for all user defined types. Do we want to do this? The answer depends - for some types we want a non member Serialize, but for others a member function is a better solution.

The question is now - how do we separate the types in user defined types with non-member Serialize and user defined types with member Serialize. The answer for the first part is trivial - we just let the compiler do it for us. Any non-member Serialize overload that accepts a user defined type T is disabled for our previous template Serialize function by enable_if construct:

struct Point
{
    Point() = default;
    Point(float _x, float _y) : x(_x), y(_y){}
   

    float x = 0;
    float y = 0;
};

template<typename Stream>
void Serialize(Stream& stream, Point& value)
{
    Serialize(stream, value.x);
    Serialize(stream, value.y);
}
And one usage sample/test:

{
    Point somePoint{123.f, 321.f};
    std::ofstream stream("test", std::ios::binary);
    Serialize(stream, somePoint);
}

{
    Point somePoint;
    std::ifstream stream("test", std::ios::binary);

    Serialize(stream, somePoint);
    assert(somePoint.x == 123.f && "Sample code is not working...");
    assert(somePoint.y == 321.f && "Sample code is not working...");
}
How does this work with only one Serialize function? The C++ static dispatch mechanism really shines here!  - Serialize call for members x and y depends on the stream type. When the stream type is input stream, Serialize(std::istream&float) is called and when the stream type output stream Serialize(std::ostream&float) is called. In this way only one function for user defined types is needed. This is very handy because it is less error prone than defining 2 functions, one for writing and one for reading the values.

Now we need to handle the special case when Serialize is a member function. To separate types that have member function Serialize from the other types we have to ask the compiler if an arbitrary type T has a member function called Serialize than accepts std::istream or std::ostream as parameter.

First I will introduce a concept(not related to C++ concepts) presented at cppcon called void_t. void_t is defined in a very simple way using variadic templates, but it has one magic property - it can tell us if some type or expression is well formed:

template<typename...>
using void_t = void;
Yes! This is all! These 2 lines enable us to find out if some arbitrary type T has a member type, has a member function, is complete and many more! One alternative definition for void_t, that works fine in GCC if the following:

template<typename... T>
struct make_void
{
    typedef void type;
};

template<typename... T>
using void_t = typename make_void<T...>::type;
Why 2 definitions for void_t? Because the current standard is unclear. Let's find out how one can use void_t:

When is void_t well formed? When all the types it receives are well formed. What happens when one of the types provided to void_t is not well formed? void_t is not well formed and, in templates, this will make the compiler ignore some class or function where it is used.

Let's go back to the problem we are trying to solve here. I will make a statement - an arbitrary type T has a member function F that accepts a parameter P when the expression:

std::declval<T>().F(std::declval<P>())
is well formed. The expression "creates" an object of type T, "creates" an object of type P and tries to call member function F from T using P. Why I put "creates" between quotes? - because it doesn't really create the object. This expression can be used only in expressions that don't evaluate it - they just look at its properties. Evaluating the value returned by std::declval triggers undefined behaviour

For our special case this expression looks like:

std::declval<Serializable&>().Serialize(std::declval<std::istream&>())
std::declval<Serializable&>().Serialize(std::declval<std::ostream&>())
And now to integrate this to get the answer for the question "does type T have member function Serialize that accepts as parameter some stream type S?":

template<typename T, typename S, typename = void>
struct has_serialize : std::false_type{};

template<typename T, typename S>
struct has_serialize<T, S, void_t<decltype(std::declval<T&>().Serialize(std::declval<S&>()))>> : std::true_type{};
Here we can see the beauty of void_t in action - second definition of has_serialize is selected only when decltype(std::declval<T&>().Serialize(std::declval<S&>())) is well formed.

As we know if the type has a member function Serialize or not, we need to write a non-member function that accepts some arbitrary type T and calls the member function T::Serialize and enable this new function only when the member function exists for T:

template<typename S, typename T>
enable_if_t<has_serialize<T, S>::value, void> Serialize(S& stream, T& value)
{
    value.Serialize(stream);
}
And... this is it! It should be everything we need to implement and extend the serialisation system.

To see how easy to use and to extend this system I have a small usage example/test case with 2 classes - Player and Gun. It also uses Point defined before:
Gun class:

struct Gun
{
    Gun() = default;
    Gun(int _damage) : damage(_damage){}

    template<typename StreamType>
    void Serialize(StreamType& stream)
    {
        ::Serialize(stream, damage);
    }
   
    int damage = 10;
};
And player class now:

struct Player
{
    Player() = default;
    Player(int _hp, const Point& _position) : hp(_hp), position(_position){}

    template<typename StreamType>
    void Serialize(StreamType& stream)
    {
        ::Serialize(stream, hp);
        ::Serialize(stream, position);
        ::Serialize(stream, gun);
    }
   
    int hp = 100;
    Point position;
    Gun gun;
};
And the use case:

{
    Player somePlayer(9001, {123.f, 321.f});
    std::ofstream stream("test", std::ios::binary);
    Serialize(stream, somePlayer);
}

{
    Player somePlayer;
    std::ifstream stream("test", std::ios::binary);

    Serialize(stream, somePlayer);
    assert(somePlayer.hp == 9001 && "Sample code is not working...");
    assert(somePlayer.position.x == 123.f && "Sample code is not working...");
    assert(somePlayer.position.y == 321.f && "Sample code is not working...");
}
This system is very easy to customise - for example to move from std::streams to other data format one has to change or to provide overloads for only 2 function - the ones defined for fundamental types. Because all other Serialize functions are templates that call the functions defined for fundamental types they will just adapt to the new data format without any need to change existing code. If the function for fundamental types will be changed to use text files, xml files or any other format or serialisation technique nothing else has to be changed! As a simple test I changed these 2 function to use text based streams:

template<typename T>
enable_if_t<std::is_fundamental<T>::value, void> Serialize(std::ostream& stream, T value)
{
    stream<<value<<std::endl;
}

template<typename T>
enable_if_t<std::is_fundamental<T>::value, void> Serialize(std::istream& stream, T& value)
{
    stream>>value;
}
and everything else works as expected. Without any other changes!

Regarding performance - there is no overhead, all the functions are direct calls (no virtual calls) and most of them probably will be inlined by the compiler. Also for some types, like Point for example Serialize function can write/read both values in the same call, but this approach will reduce the flexibility when facing changes on data format or stream types.

One point of attention - has_serialize will give the correct answer, but probably unwanted for incomplete types (beware of forward declarations)! Any incomplete type doesn't have a member Serialize, and to be honest it won't have any member at all. To fix this a static_assert can be used in non-specialised (the one that returns false) version of has_serialize when the type T is incomplete. To check if a type is complete or not void_t can be used in combination with sizeof operator, but I will let this as an exercise.

If you have any constructive comments or suggestions I'm eager to hear them!
If you find any mistakes please point them out to be able to correct them. Maybe me or others won't see them and they will live here forever.