Thursday, May 23, 2013

Quickly Searching For a String in a Buffer

So this is a bit embarrassing that I have never seen this solution before, but I recently saw this problem in a Coursera lecture.

Given a buffer sequence of length \(N\) and a query sequence of length \(m\), find all occurrences of the query in the buffer.

The most transparent way to solve this problem is to search each starting position in the buffer for the query string. This is clearly \(O(N*m)\) (for small \(m\), that is. It is really around \(O((N-m)*m)\). However, in big \(O\) fashion, we normally say \(O(N)\) and leave it at that).

(defun get-string (i length buffer)
  "Extract the string without any copying or memory use."
  (make-array length
              :element-type 'character
              :displaced-to buffer
              :displaced-index-offset i))

(defun naive-search (query buffer)
  "Find all occurrences of query in buffer."
  (iter (for i :to (- (length buffer) (length query)))
    (when (equal query (get-string i (length query) buffer))
      (collect i))))

Although this all pretty silly, as Common Lisp has a sequence search built in called search. In addition, it also has third party libraries that can do string searches and much more:

;; Can't get much simpler than...
(defun search-search (query buffer)
  "Find all occurrences of query in buffer."
  (iter
    (for i :initially (search query buffer :start2 0)
           :then (search query buffer :start2 (1+ i)))
    (while i)
    (collect i)))

;; If you are using strings, regular expressions are pretty versatile.
(ql:quickload :cl-ppcre)
(defun regex-search (query buffer)
  "Find all occurrences of query in buffer."
  (let (acc)
    (ppcre:do-matches (start end
                       query buffer
                       (reverse acc))
      (push start acc))))

The interesting part of these is that they are probably much smarter than my naive implementation. Besides being highly tuned, they also most likely use parts of previous failed matches in subsequent matches, which means that they might run in \(O(N)\) time (i.e. no \(m\) factor).

The point is that either way I do this it would take \(O(N)\) time to simply search the text for the value in the most naive way possible. In the lecture, however, they stated that you could sort the data and then perform a bisection search. I was taken aback as it seemed to me that you cannot sort a body of text without screwing up the order, which is needed because we are actually interested in matching a sequence of characters in a larger sequence.

The way we can do this, "sort the sequence" and keep the sequence in order, is to sort a different buffer that holds the indexes into the sequence. Sorting takes \(O(N\log N)\), so a single search should actually take longer to compute, but once you have the indexing you can perform subsequent searches (with the same length or shorter queries) in \(O(\log N)\). In order to accommodate this usage pattern, I have included an ad-hoc, manual, partial memoization mechanism. I return the computed sorted indexes (along with the maximum length of a valid query) as a second return value and take them as an optional argument (I use a class here so that the printed value doesn't flood your terminal). If you pass invalid cache parameters, the function detects this and recomputes.

;; Define a container for our cache
(defclass cache ()
  ((max-length :initarg :max-length :accessor max-length)
   (buffer :initarg :buffer :accessor buffer)
   (indexes :initarg :indexes :accessor indexes)))

(defun indexed-search (query buffer &optional cache)
  "Find all occurrences of query in buffer."
  ;; Extract (or compute) the cache
  (destructuring-bind (max-query-length sorted-indexes)
      (if (and cache
               (eq buffer (buffer cache))
               (<= (length query) (max-length cache)))
          (list (max-length cache) (indexes cache))
          (list (length query)
                (stable-sort (iter (for i :to (- (length buffer) (length query)))
                               (collect i :result-type vector))
                             #'string<
                             :key (lambda (i) (get-string i (length query) buffer)))))
    (labels ((bisection (lo hi)
               ;; A simple bisection search
               (let ((mid (floor (+ lo hi) 2)))
                 (cond ((= lo hi) lo) ; we have either found the first match or
                                      ; there are no matches
                       ((string<= query (get-string (aref sorted-indexes mid)
                                                    (length query) buffer))
                        (bisection lo (min mid (- hi 1))))
                       (t (bisection (max mid (+ lo 1)) hi))))))
      ;; Perform the search and return the index and cache
      (values (iter (for i :from (bisection 0 (length sorted-indexes)))
                (while (string= query (get-string (aref sorted-indexes i)
                                                  (length query) buffer)))
                (collect (aref sorted-indexes i)))
              (make-instance 'cache :max-length max-query-length
                                    :buffer buffer
                                    :indexes sorted-indexes)))))

And indeed, after the \(O(N\log N)\) preprocessing, this runs in \(O(\log N)\) time. The timings work out that the initial sort takes a very long time (presumably due to poor performance in string<= for lexical order), but the subsequent queries are basically instantaneous (too fast to measure even for large buffer sequences). So, while you probably don't want to use this for normal searches, you might want to do this if you need to make many queries (of a limited length) on a immutable sequence, and it could result in a huge performance increase.

Well, I guess this just serves as a reminder that I still have a lot of basic stuff to learn. However, after writing this up, I doubt that I will forget it…

No comments :

Post a Comment