2013-12-09

How to add very large data blobs to C and C++ binaries

This blog post explains how to add large (>100 MB) data blobs to C and C++ binaries in an efficient way.

Short data

For short data, create a .c file containing this:

const char hi[14] = "Hello, World\077!\n";
const char *hi_end = hi + sizeof(hi) / sizeof(char);

And use it like this from another .c file:

#include 
extern const char hi[], *hi_end;
int main() {
  size_t hi_size = hi_end - hi;
  fwrite(hi, 1, hi_size, stdout);
  return 0;
}

When specifying the string literal, you have to escape not only \, " and nonprintable bytes (code less than 32 or larger than 126, this works for both no matter char is signed or unsigned), but also ?, because without escaping ? unintended trigraphs may get formed. Escape nonprintable bytes as exactly 3-digit octal literals, such as \012. Don't use shorter escapes or hex escapes, because they may get combined with the following bytes if they are in [0-9a-fA-F].

Another option is to specify the bytes as integer literals, but it can get quite long for 8-bit bytes, because char can be either signed or unsigned:

const char hi[] = {126, 127, (char)128, (char)129 };
char *hi_end = hi + sizeof(hi) / sizeof(char);

These solutions work for both C and C++.

Long data

Neither of the solutions above works well for for long blobs, because some C (or C++) compilers don't accept string literals longer than a few megabytes, and putting string literals next to each to get implicit implicit concatenation other also doesn't work if there are too many. Typical failure modes are: compile error, excessive memory usage, extremely slow (slower than O(n) or with a large constant) compilation.

If we don't care about the actual contents of most the array, there is an easy and quick solution:

const char hi[1234567] = "SIGNATURE";
const char *hi_end = hi + sizeof(hi) / sizeof(char);

This creates an .o file containing 1234567 bytes of data, starting with the bytes SIGNATURE and having 1234557 '\0' bytes afterwards. Surprisingly, gcc compiles this very quickly, with little memory use, generating a small (shorter than 400 bytes) .S (assembly) file, spending most of the time writing the .o file to disk. It's easy to prove the little memory usage by making the data size 400 MB, and limiting the virtual memory size to 30 MB (ulimit -v 30000).

So now we have an .o file with the correct size, with a signature and lots of '\0's instead of the real data. All we need to open the file, read the first few kilobytes, find the signature, and replace it and following the zeros with the real data. We can do this without having to understand the .o file format: it's enough to choose a long enough signature to make it unique within the .o file.

The cobjgen Perl script automates all this. Use it like this:

$ cobjgen input_blob.bin hi gcc -c hidatac.c
$ gcc -o progc main.c hidatac.o

$ cobjgen input_blob.bin hi g++ -c hidatacc.cc
$ g++ -o progcc main.cc hidatacc.o

No comments: