Sunday, September 08, 2013

Append-only files

I need to implement an append-only persistent data structure for a project using CQRS (I will use it for the event store). The records will be variable in length (a string serialization of the events using either XML, JSON or YAML - I haven't decided yet).

Now, appending to a text file is easy in Windows; just open the file for append. However, I will also need to read the events, starting with a given Id; that's harder to do with a text file. Given that I'm in a "designing algorithms" mood lately (see my first Coursera certificate) and that I haven't been able to find a satisfactory library written by someone else, here's what I came up with.

The data file

Given that the records in the data file will have different lengths, I could use an end-of-record character as a delimiter; however, finding the next record would require a full record scan in that case. (Plus, I am not 100% sure that I want to use text serialization - maybe I'll switch to binary?) That means using the same "trick" that Delphi uses for its strings: prefix them with the string length.

This means that the data file will have the following structure:

(record #) Size (long) Data
0 20 … 20 bytes …
1 74 … 74 bytes …
… and so on…

Each record starts with a long size, followed by size bytes of data. (The record number is not part of the file, it is just shown for completion.)

The index file

The index file is very simple: it is a list of long values, each one pointing to the start of the matching record in the data file (to the size field). The first record is always a zero, since the data file starts there.

Operations

  • Rebuilding the index file is extremely simple; the following pseudo-code assumes the data file is accessible as a byte[] array called data (for example, as a memory-mapped file):
  IEnumerable<long> BuildIndex(byte[] data):
    position = 0
    while (position < data.Length)
      yield return position
      size = read_long(data, position)
      position += size + sizeof(long) // the size doesn't include the field itself
  • Appending a new record:
  void Append(byte[] data, byte[] record):
    position = data.Length
    index.add(position)
    write_long(data, position, record.Length)
    write_bytes(data, position + sizeof(long), record)
  • Reading a specific record:
  byte[] Read(byte[] data, long i):
    position = index[i]
    size = read_long(data, position)
    return read_bytes(data, position + sizeof(long), size)
  • Reading all records starting from a given position:
  IEnumerable<byte[]> ReadFrom(byte[] data, long i):
    position = index[i]
    while (position < data.Length)
      size = read_long(data, position)
      yield return read_bytes(data, position + sizeof(long), size)
      position += size + sizeof(long)

The above code fragments use the following primitives:

  • long read_long(byte[] data, long position) Reads a long from the data array, starting at the given position
  • byte[] read_bytes(byte[] data, long position, long size) Reads a record from the data array, starting at the given position, size bytes long
  • void write_long(byte[] data, long position, long value) Writes a long to the data array, starting at the given position
  • void write_bytes(byte[] data, long position, byte[] record) Writes a byte array, starting at the given position

Of course, even the operations I described are pretty low-level (using a byte array instead of an object); I will have to implement a higher-level interface for real use.

I have uploaded the code to my GitHub repository. It was not written using TDD, I have only added a smoke test to verify that it works correctly. I think the design is pretty clean so far.

Edit: the timings are much better than I expected:

Using ...\tmpC056.tmp - writing and reading 100000 records
append - time elapsed: 3176 msec
read all - time elapsed: 1252 msec
rebuild index - time elapsed: 273 msec
Press any key to continue . . .

That means that, if an app will have to read and replay a million events when it starts, it won't take more than half a minute - which is acceptable.