Wednesday, June 15, 2011

Versioned Arrays

I just learned of a method of creating functional data structures from mutable types called versioning. Basically, if you have mutable data like an array, you can implement functional changes of the data structure by storing the differences between multiple references. This is a might bit better than copying the data structure, particularly in the case of a large array. For an array of size $N$, versioned arrays can hold two arrays with $m$ changes between them using only $O(N+m)$ space. Further, while the access time for an element when the array is $m$ away from it's version is $O(m)$, many use cases require only accessing arrays a fixed number of changes away, making access time $O(1)$.


A Simple Implementation using Lists

This can be implemented simply enough using cons cells and lists. Implementing in Common Lisp, accessing an element is as follows:

  1. A versioned array is list whose last element is an array
  2. The values before the last element are a list of changes that need to occur
    to mutate the array into the one we are interested in.
  3. Access to the array involves applying all changes to the array and inverting
    the changes as we a moving the array to the other side of them (I'll try to
    clarify later)
  4. Leaving you with a list of length one, containing an array with the desired
    values in it, at which point you can access the data.

By moving the array to the version we are currently accessing from, we are betting that more accesses will happen from this, or a similar, version. This will mean the rest of the accesses will be faster than the first.

(defpackage :versioned-arrays
    (:use :cl :bt :iter :modf)
  (:export #:make-versioned-array
           #:varef ))

(in-package :versioned-arrays)

(defun make-versioned-array (dimensions
                             &rest args
                             &key (element-type t)
                             initial-element initial-contents
                             adjustable fill-pointer
                             displaced-to displaced-index-offset )
  "Make a versioned array."
  (when (or adjustable fill-pointer displaced-to displaced-index-offset)
    (error "The capabilities: adjustable, fill-pointer, displaced-to,
    displaced-index-offset are not implemented yet") )
  (list (apply #'make-array dimensions args)) )

(defun varef (v-arr &rest idx)
  (multiple-value-bind (array changes) (get-array v-arr)
    (invert-changes (cdr v-arr) changes v-arr)
    (setf (car v-arr) array
          (cdr v-arr) nil )
    (apply #'aref (car v-arr) idx) ))

(defun get-array (v-arr)
  (if (arrayp (car v-arr))
      (values (car v-arr) nil)
      (multiple-value-bind (array changes)
          (get-array (cdr v-arr))
        (destructuring-bind (new-val &rest idx) (car v-arr)
          (let ((old-val (apply #'aref array idx)))
            (setf (apply #'aref array idx) new-val)
            (values array
                    (cons (cons old-val idx) changes) ))))))

(defun invert-changes (old-v-arr changes last)
  (cond ((null changes) nil)
        (t (setf (cdr old-v-arr) last
                 (car old-v-arr) (car changes) )
         (invert-changes (cdr old-v-arr) (cdr changes) old-v-arr) )))

First, let's consider a simple linear version history. This would happen if you had an array and only modified the original or the latest versions of the array. Using the standard cons cell diagrams where red is the car and blue is the cdr, our linear versioned array would look something like this:

Accessing an element in a linear versioned array. This is the way the structure would change if the last access was via Ref 2 and the next was via Ref 1. Any cons cell is a valid version of this array before and after the "rebase"

Modification goes very similarly:

  1. Given that we want to modify the value at indices idx to new-val
  2. We first store the current value at idx. This has the side effect of
    moving our array to the current version.
  3. We cons a new cell to store the array in, place the delta where the array
    currently is, and modify the cdr of that cell to point to the new one.

This can be simply coded using my Modf library:

(define-modf-function varef 1 (new-val v-arr &rest idx)
  (let ((old-value
         ;; This moves the array to our version
         (apply #'varef v-arr idx) ))
    (let ((arr (car v-arr)))
      (setf (apply #'aref (car v-arr) idx) new-val)
      (setf (cdr v-arr) (list arr)
            (car v-arr) (cons old-value idx) )))
  (cdr v-arr) )

When looking at what is above, it is important to note that this is a mutation filled computation. This does not involve building lists the normal way with cons. Instead we are modifying lists in place. This is necessary as we need any changes we make to be seen by any reference to the versioned array. This is what I meant by inverting the changes. Inverting means reversing the order of the deltas as well as swapping the values in the deltas with the values in the array. The issue here is that we may have any given number of references to this data structure floating about. We need to modify the values that might be referenced so that if we use those reference later, they will still work.

So things might be coming into focus, although a versioned array is a list of deltas with it's last element an array, many other versioned arrays (or versions of the array) might be sharing some structure with this array. If they truly are versions of the same array, they at the very least share the last element in the list, which contains the array they are versioning. In the following figures we see the structure of a more complicated array as elements are accessed and modified (edited).

Accessing an element in a versioned array. This array has several versions. As demonstrated here, the version history is in general a tree. Upon access, the array is moved to the access point and all deltas are adjusted to maintain the array.

Accessing an element on a different branch of the version tree. Notice that no version is invalidated by this operation, not even the versions on the branch that isn't considered at all.
Editing an element in the array is much like an access except that at the end it creates another delta with the specified edit.

Multithreading

The very fact that we are messing around with mutation should clue you in that we are probably going to have thread safety issues. In fact, even reading is a mutating operation for this structure, so it isn't even safe to read from multiple threads at the same time. Due to this, everything inside the internals of this data structure needs to be wrapped in a lock unique to the structure. I chose to switch from a list to a structure so I could hold the lock in a straight-forward manner (and so I can have a type to specialize print-object on).

There is a single lock on the array. This means that, unlike tree based implementations of functional arrays, working with one version will halt any work with another version.


Performance

Let's see how this performs on the well known N-Queens problem. This involves backtracking, which is one of those use cases where we are only interested in array versions a constant number of deltas away from the current. If we ignore how simple the queens problem is and explicitly build a chess board to check the other attacking squares, we would get something like what follows. Here I have a version based on versioned arrays, and one sketched out for standard arrays. I also implemented this for a functional array based on FSet, which uses some kind of trees.

(defun place-queen (board i j)
  "This complicated bit is to test if a queen can attack another."
  (if (and (iter (for k from 0 below (va-dimension board 0))
                 (never (and (/= i k) (varef board k j))) )
           (iter (for k from 0 below (va-dimension board 1))
                 (never (and (/= j k) (varef board i k))) )
           (iter (for k1 from i downto 0)
                 (for k2 from j downto 0)
                 (never (and (/= k1 i) (/= k2 j) (varef board k1 k2))) )
           (iter (for k1 from i below (va-dimension board 0))
                 (for k2 from j below (va-dimension board 1))
                 (never (and (/= k1 i) (/= k2 j) (varef board k1 k2))) )
           (iter (for k1 from i downto 0)
                 (for k2 from j below (va-dimension board 1))
                 (never (and (/= k1 i) (/= k2 j) (varef board k1 k2))) )
           (iter (for k1 from i below (va-dimension board 0))
                 (for k2 from j downto 0)
                 (never (and (/= k1 i) (/= k2 j) (varef board k1 k2))) ))
      (modf (varef board i j) t)
      nil ))

(defun n-queens (board n row)
  "Simple enough backtracking algorithm."
  (if (= row n)
      board
      (iter (for i below n)
            (let ((result (place-queen board i row)))
              (when result
                (thereis (n-queens result n (+ row 1))) )))))

(defun place-queen* (board i j)
  "This complicated bit is to test if a queen can attack another."
  (if (and ...similar-tests...)
      t
      nil ))

(defun n-queens* (board n row)
  "A bit more convoluted backtracking algorithm."
  (if (= row n)
      board
      (let (winning-board)
        (iter (for i below n)
              (let ((result (place-queen* board i row)))
                (when result
                  (setf (aref board i row) t)
                  (let ((try (n-queens* board n (+ row 1))))
                    (when try
                      (setf winning-board try)
                      (finish) )
                  (setf (aref board i row) nil) ))
              (finally (return winning-board)) )))))

Comparing the two, we naturally see that the functional form is more elegant and understandable. The speed is another matter though…

$N$ArrayVersioned ArrayFSet ArrayVA/ArrayFSet/Array
1114444
12224411220.5
13112121212
14172552771516.294118
151420723114.78571416.5
161161726193714.87931016.698276
17681006116714.79411817.161765
1859587751007314.74789916.929412
194058469414.617.35

The functional version is easier to understand, of course, though using something like screamer would hide most if not all of the side effect annoyances. Speed wise, though, the functional approaches seem to be around a factor of 15 slower. The FSet method is consistently slower, but not by enough to say definitively that versioned arrays are going to be faster for this problem. We do see a slow growth in the FSet implementation. We are probably seeing the $O(\log N)$ access time. There might be a completely different story to tell with proper optimization of the versioned arrays. Particularly if someone tried to get it closer to the metal, so to speak.


Conclusions

A little side note: since starting the post, I have found out that Haskell uses something similar to this in its Diff Array library with the notable exception that the version history is forced to be linear. This is done by forcing a copy if a version other than the most recent is changed. This seems like it could lead to mysterious poor performance for small changes in your algorithm and certainly doesn't seem appropriate for backtracking. Fortunately, the person who let me know about data structure in the first place is currently implementing a library for this for Haskell.

One thing to note is when this doesn't work well. It doesn't work well when you have references to versions of an array with very different deltas. For instance, if you saved some initial state of an array, made many changes, and used both arrays by, say, comparing them element by element, things could get pretty slow. To curb this, several people have developed schemes to split the versioning tree if too many deltas are in between two particular arrays. Typically this is done probabilistically, randomly splitting the delta tree with probability $\frac 1 N$. With this modification, we have an expected amortized edit time of $O(1)$ (worst case $O(N)$ which is a steep price) and an expected worst case access time to be $O(N)$. The true worst case access time is $O(m)$ for arbitrarily bad luck on the RNG, but the expected $O(N)$ is a much needed improvement over knowing it is $O(m)$ where $m$ can be arbitrarily large. Since all of the work is done in walking the deltas and rebasing the array, edits and accesses have the same time complexity.

This all sounds like gloom and doom, but really, there are a large class of problems that don't require access from very different arrays. Further, much of this trouble can be resolved by breaking the abstraction and allowing the user to explicitly force a copy.

I was thinking about it, and there is nothing here limiting you to arrays. I could, in principle, implement this kind of versioning for any mutable structure. Instead of storing indices, the arguments of the setter function, store the function and the arguments and the setter can be inverted the same way it is in Modf. It doesn't sound like too bad of an idea to have a generic functionality to stick any object into and get back an immutable object with reasonable asymptotic performance. Maybe this should be a future extension for Modf.

No comments :

Post a Comment