Animation Tech Intro Part 2: Compression

August 15, 2021

Animation Tech Intro Part 2: Compression

You can think of animation as a sequence of poses. In the last post, I explained skinning and how to apply poses to the character. So how can we set this character in motion, and how do we store the animation data itself?

In Blender (or any other 3D software), the animation is a sequence of keys interpolated in a specified way. An example is a curve made of Catmull-Rom or Bezier splines. Rotation, translation, and scale are encoded with these curves. For example, rotation in Euler Angles has 3 curves: pitch, yaw, roll. Each control point of this spline is a key authored by an animator. Also, you might encounter that I will use the word ‘curve’ to describe both 1D curve and 3D curve. This is because we start with 1D curves coming from source animation data, then we will use 3D curves at runtime in the engine. I believe it should be easy to follow, but it’s something worth mentioning upfront.

When importing the animation into the game engine, it is usually sampled with a 30 Hz frequency (or any other frequency that fits the need). So this is our starting point – a sequence of poses in 30 FPS.

Let’s start with something simple, naïve – storing each animation frame as a separate pose (array of transforms). Typically, single transform’s components are translation (vector3), rotation (quaternion), and scale (vector3). Modern characters in AAA games can easily have 1000+ transforms. This sounds crazy. The human body has 206 bones (biologically). Why on earth do we need over 1000? Well, that’s because we’re not really animating the equivalent of biological bones. Limbs, fingers, and spine are indeed close to the real bones in character animation, and we need around 70-100 transforms for these. However, we also need to animate muscles on the face. These can take up to 350-400 transforms. The rest is spent on body muscles and reference locators (slots for items, helper transforms, etc.). However, muscles on the body are usually the result of constraint systems rather than authored animation, so we don’t need to count these. As for the facial animation, that’s a story for another post, so that I will skip them too. This leaves us with 100 transforms to animate the body. Let’s do a back-of-the-envelope calculation to see how much memory is required for 1 second of animation in 30 frames-per-second:

Amount of memory required for 1 second of raw poses for 100 bones

So 1 second of animation in 30 FPS for that character would be around 117 KBs. How do you know whether it’s a lot or not? Well, you might need 30-60 seconds of animation for simple character locomotion, which yields 3-7 MBs. And that’s only one style of locomotion. It would be best if you had a couple of these for other characters. So let’s say you have 5 styles, that’s 15-35 MBs. Also, what about the actions that the player and these characters can do? Let’s say there are another 30 seconds of these for each style. That’s around 30-50 MBs. Then we have various reactions, weapon usage, interactions, and so on. And that’s only for the body. What about the face? You get the point. It’s growing rapidly, and we haven’t even mentioned all the fluff, like characters’ background activities. So how can game engines store and use thousands of animation clips and still fit in the budget? That’s what I will try to explain in this post.

So, where do we start? First, let’s take a closer look at a run animation. With a larger set of data, I would probably get some statistics first, but with this example, we can actually get quite far with few simple observations.

It’s all about rotation

Most bones’ lengths are constant. This means they do not translate during animation. The only bone that is using translation is the ‘Hips’ bone. In other animations, translation might be used for reference locators (item slots, etc.). This means, for the rest of the skeleton, every frame, the translation component of our transforms is the same. There’s also no need for scale to change. Together, translation and scale makes up almost 60% of the data!

The first thought that comes to my mind is keeping one key for each of the constant tracks. Additionally, if the values are equal to bind-pose (most of them should), we can use that pose instead of storing any data in the animation clip. Bind-pose is kept together with the rig (skeleton), so we have access to this data.

We can do the same trick for rotation: tracks that are equal to bind-pose can be removed; tracks that are constant but different than bind-pose can be stored as a single key; the ones that are animated can be kept as they were.

Face and fingers are constant in running animation

We can reduce the data by 60-70% without any quality loss with these simple methods. To reduce memory even further, we need to compress the data we’re left with, which will involve some quality loss.

Key reduction

The simplest method I can think of is linear key reduction. The idea is simple: if you can recreate the key by interpolating its neighbours, you can remove that key. We will need some threshold for this method to work, as it’s a rare situation where the recreated key is perfectly on the line between two other keys. An alternative method is to remove every second or third key, thus reducing the frame rate from 30 Hz to 15 Hz or 10 Hz. This might sound like destroying the original animation, but you won’t notice any difference in many cases. This trick was used in Uncharted 2: Among Thieves (see this GDC presentation, you might need a GDC Vault account to watch this).

With key reduction techniques we’re still removing entire keys, but what about single keys? How can we compress them?

Quantisation

So far, the single key for translation or scale is 3x float number. For rotation, it is a 4x float number. However, we can exploit the nature of the quaternion. Rotation is represented by unit quaternions, so this means we can recreate one of the quaternion’s components based on the other three. The method I use is based on the basic harmonic mean. For this and other methods, I recommend checking the Quaternion Quantisation article by Marc B. Raynolds. At this point, it makes sense to treat a single key of animation as a 3D vector instead of 1D. Also, it’s simplifying the implementation.

// 4(sqrt(2)-1)
const float Km  = 4.0 * 0.4142135679721832275390625;

// sqrt(2)+1 = 1/(sqrt(2)-1)
const float Khf = 2.414213657379150390625;

// 3-2sqrt(2)
const float Khi = 0.17157287895679473876953125;

// compression
void quat_fhm(fm_quat q, fm_vec3* v)
{
	float s = Khf / (1.0 + q.r + sqrt(2.0 + 2.0 * q.r));
	
	v->x = q.i * s;
	v->y = q.j * s;
	v->z = q.k * s;
}

// decompression
fm_quat quat_ihm(const fm_vec3* v)
{
	float d = Khi * fm_vec3_dot(v, v);
	float a = (1.0+d);
	float b = (1.0-d)*Km;
	float c = 1.0/(a*a);
	
	float bc = b * c;
	
	fm_quat q;
	q.i = v->x * bc;
	q.j = v->y * bc;
	q.k = v->z * bc;
	q.r = (1.0 + d * (d - 6.0)) * c;
	
	return q;
}

Now, since it’s a unit quaternion, we know that each component must be within the range of -1 to 1. That most likely means that we don’t need the full range of the float number. At the same time, we probably don’t need the precision of 32 bits, do we? From my experience, 16 bits is enough for most if not all cases, maybe except for cinematic quality animations. So the idea is simple: the range -1..1 is mapped to 0..1, then to 16-bit unsigned integer. Decompression is just the reverse of this. Of course, this means we will introduce some precision error. However, it’s small enough we can live with it.

Compression and decompression, notice the error (0.5 vs 0.499)

An additional benefit from encoding quaternion with three 16-bit values is we can encode the position the same way. However, what’s the endpoints range? Is -1..1 enough? For my needs, I hardcoded the range to be within 20 meters from the centre, so that’s -20..20. However, you can store the range extents with the clip, extracting them from the original data.

Can you get lower than 16 bits? Yes, with help of range reduction, why not.

Range reduction

Some curves might stay within a certain range lower than the full possibilities, so we can limit the range while compressing and decompressing to preserve more quality. This is kind of what I did with the position. I know my animations are no more than -20 to 20 meters away from the centre. So a range of -40 to 40 with the same bitrate would have lower precision. However, with some automated methods, it’s possible to limit the range for rotation as well. So lowering the range is increasing the precision. This leaves us room for experimenting with the bitrate while preserving good enough quality.

There are various ways of approaching that. For example, compressing subsequent blocks with different bitrates or compressing bone tracks with different bitrates, perhaps even compressing single curves differently. Whatever the method, we need a way to find out what’s the lowest bitrate tolerable. For that, it’s worth knowing that error in the animation accumulates along the chain of bones. The deeper the hierarchy, the bigger the error on leaf bones. An excellent GDC talk by Nicholas Frechette is all about range reduction and how to implement it. The idea is simple: a bruteforce algorithm that tries different bitrates and checks the resulting error on virtual vertices skinned to the bones of the skeleton.

Sampling

So far I showed various observations and exploits coming from them regarding the animation data. Now it’s time to put this together and actually sample the animation clip.

typedef struct fa_anim_curve_key_t
{
	uint16_t keyTime;
	uint16_t keyData[3];
} fa_anim_curve_key_t;

typedef struct fa_anim_curve_t
{
	uint16_t index;
	uint16_t numRotKeys;
	uint16_t numPosKeys;
	fa_anim_curve_key_t* rotKeys;
	fa_anim_curve_key_t* posKeys;
} fa_anim_curve_t;
	
typedef struct fa_anim_clip_t
{
	fc_string_hash_t name;
	float duration;
	uint32_t numCurves;
	uint32_t numDataKeys;
	fa_anim_curve_t* curves;
	fa_anim_curve_key_t* dataKeys; // all animation keys
} fa_anim_clip_t;

The animation clip has two allocations: one for all the curves, another for all the keys. Each curve has pointers to the data buffer for its keys. The data is a sequence of position and rotation keys for each curve. If the position or rotation within a curve is constant, then a single key is stored.

Data buffer memory layout, keys are stored in continuous memory

void fa_anim_clip_sample(const fa_anim_clip_t* clip, float time, fa_pose_t* pose)
{
  // iterate each curve (bone) 
  const uint32_t numCurves = clip->numCurves;
  for(uint32_t i_c=0; i_c<numCurves; ++i_c)
  {
    const fa_anim_curve_t* curve = &clip->curves[i_c];
  
    const uint16_t idxXform = curve->index;
    
    // rotation
    {
      const uint16_t numKeys = curve->numRotKeys;
      uint16_t idx = 0;
      
      // find upper index by time (this could be a binary search)
      while(idx < (numKeys-1)
        && fa_decompress_key_time(curve->rotKeys[idx].keyTime) < time)
      {
        ++idx;
      }
      
      const uint16_t upperIdx = idx;
      const uint16_t lowerIdx = idx == 0 ? idx : idx - 1;
      
      fm_quat rot;
      
      if(lowerIdx == upperIdx)  // constant key 
      {
        fa_decompress_rotation_key(&curve->rotKeys[idx], &rot);
      }
      else  // at least two keys - interpolate 
      {
        fm_quat rot1;
        fa_decompress_rotation_key(&curve->rotKeys[lowerIdx], &rot1);
        
        fm_quat rot2;
        fa_decompress_rotation_key(&curve->rotKeys[upperIdx], &rot2);
        
        const float time1 = fa_decompress_key_time(curve->rotKeys[lowerIdx].keyTime);
        const float time2 = fa_decompress_key_time(curve->rotKeys[upperIdx].keyTime);
        
        float alpha = (time - time1) / (time2 - time1);
        fm_quat_lerp(&rot1, &rot2, alpha, &rot);

        // because we do LERP on quaternion (faster on CPU), we have to normalise it 
        fm_quat_norm(&rot);
      }
      
      pose->xforms[idxXform].rot = rot;
    }
    
    // position
    {
      const uint16_t numKeys = curve->numPosKeys;
      uint16_t idx = 0;
      
      // find upper index by time (this could be a binary search)
      while(idx < (numKeys-1)
        && fa_decompress_key_time(curve->posKeys[idx].keyTime) < time)
      {
        ++idx;
      }
      
      const uint16_t upperIdx = idx;
      const uint16_t lowerIdx = idx == 0 ? idx : idx - 1;
      
      fm_vec4 pos;
      
      if(lowerIdx == upperIdx)  // constant key 
      {
        fa_decompress_position_key(&curve->posKeys[idx], &pos);
      }
      else  // at least two keys - interpolate 
      {
        fm_vec4 pos1;
        fa_decompress_position_key(&curve->posKeys[lowerIdx], &pos1);
        
        fm_vec4 pos2;
        fa_decompress_position_key(&curve->posKeys[upperIdx], &pos2);
        
        const float time1 = fa_decompress_key_time(curve->posKeys[lowerIdx].keyTime);
        const float time2 = fa_decompress_key_time(curve->posKeys[upperIdx].keyTime);
        
        float alpha = (time - time1) / (time2 - time1);
        fm_vec4_lerp(&pos2, &pos1, alpha, &pos);
      }
      
      pose->xforms[idxXform].pos = pos;
    }
  }
}

The sampling function is pretty simple. First, iterate all curves in animation. Then, for each curve, find the keys by sampling time and interpolate between them. Do that for both position and rotation. It might be surprising that instead of SLERP (spherical interpolation), I’m doing LERP (linear interpolation) operation on quaternions. This is because it’s faster on the CPU, and with little change between keys, the error is minimal. However, the quaternion needs to be normalised afterwards.

Conclusion

What I explained in this post is simple animation compression. If I were to summarize it with a single sentence, it would be: exploit the domain knowledge. Here are the points from this post:

character’s body animation is all about rotation – scale and translation are rare,
some of the keys are constant, some are equal to bind pose,
4th component of a unit quaternion can be recreated based on the other 3,
we don’t need 32 bits for each float, 16 bits is enough for most of our use cases,
some of the keys can be recreated by interpolating neighbour keys.

You might also want to learn and try totally different compression methods like spline-fitting. Here’s a great article about spline-fitting compression by Riot Games. If you are looking for more sources on animation compression, check the references at the end, I’ve included all the links mentioned so far, plus a few more.

It’s only a starting point for further work. What else can be done? Optimisation on CPU using SIMD instructions – that’s a story for another post. Memory layout optimization to avoid cache misses during animation sampling. Another broad topic is animation streaming. We don’t need to load every animation possible at once. Even the animations used could be partially streamed. However, with what I have explained so far, you should have a brief understanding of how game engines can fit so many animations and stay within the memory budget at the same time.

This is it for today. Next, I will talk about animation blending and building some simple features with it.

References

Adam Dutkiewicz