Algorithmic std::string creation

Suppose we want to create a C++ std::string object, and we know how long the resulting string will be, and we know how we'll generate the content of the string, but we do not have a copy of said content in memory. As a concrete example, consider taking a block of memory, and returning a string whose content is the hexadecimal representation of that memory block. The result should satisfy the following tests:

std::string hexify(const char*, size_t);

assertEqual("baadf00d", hexify("\xBA\xAD\xF0\x0D", 4));
assertEqual("baad"    , hexify("\xBA\xAD\xF0\x0D", 2));
A naïve approach might be something along the following lines:
#include <sstream>
#include <iomanip>

std::string hexifyChar(int c)
{
  std::stringstream ss;
  ss << std::hex << std::setw(2) << std::setfill('0') << c;
  return ss.str();
}

std::string hexify(const char* base, size_t len)
{
  std::stringstream ss;
  for(size_t i = 0; i < len; ++i)
    ss << hexifyChar(base[i]);
  return ss.str();
}
This is fairly readable code, and if speed wasn't a concern, could well be a viable solution. On the other hand, if speed is desired, then this code isn't particularly good, at least not without a clairvoyant optimising compiler. Within both functions, the rather heavyweight std::stringstream tool is used, which may be slow due to be extremely flexible, and also due to not knowing how long its result will be. Given that we do not need most of the flexibility of std::stringstream, and we know how long the result will be (2 for hexifyChar, and 2 * len for hexify), we should expect to be able to do better.

As we want to construct a std::string, let us look at its constructors, and see which ones might work for us:

  1. Default constructor: string ();
  2. Copy constructor: string (const string& str);
  3. Substring constructor: string (const string& str, size_t pos, size_t n = npos);
  4. Existing content constructor: string (const char * s, size_t n);
  5. C string constructor: string (const char * s);
  6. Repetition constructor: string (size_t n, char c);
  7. Iterator constructor: template<class Iterator> string (Iterator begin, Iterator end);

None of these look particularly like what we want, but the last option might be viable if we can wrap a hexifying algorithm inside an iterator. If said iterator is infact a random access iterator, then the constructor will be able to obtain the length of the result before it starts fetching the individual characters.

Writing an STL-compatible iterator usually takes a lot of code, but we can use boost::iterator_facade to do most of the work for us, giving us the following code:

struct HexifyIterator
  : boost::iterator_facade<HexifyIterator, const char,
                           boost::random_access_traversal_tag>
{
  HexifyIterator()
    : ptr_()
    , nibble_() {}
  HexifyIterator(pointer ptr)
    : ptr_(reinterpret_cast<const unsigned char*>(ptr))
    , nibble_(1) {}
  HexifyIterator(const HexifyIterator& other)
    : iterator_facade(other)
    , ptr_(other.ptr_)
    , nibble_(other.nibble_) {}

private:
  friend class boost::iterator_core_access;

  const unsigned char* ptr_;
  difference_type nibble_;

  void increment() { advance(1); }
  void decrement() { advance(-1); }
  void advance(difference_type n)
  {
    nibble_ -= n;
    ptr_ -= nibble_ / 2;
    nibble_ %= 2;
    if(nibble_ < 0)
    {
      ++ptr_;
      nibble_ += 2;
    }
  }

  difference_type distance_to(const HexifyIterator& other) const
  { return (other.ptr_ - ptr_) * 2 + (nibble_ - other.nibble_); }

  bool equal(const HexifyIterator& other) const
  { return ptr_ == other.ptr_ && nibble_ == other.nibble_; }

  reference dereference() const
  { return "0123456789abcdef"[(*ptr_ >> (4 * nibble_)) & 0xF]; }
};

std::string hexify(const char* base, size_t len)
{
  return std::string(HexifyIterator(base), HexifyIterator(base + len));
}
There is still rather more code here than I would like, but in terms of speed, my quick and crude test put it at 72 times faster than the naïve approach.

Comments

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <span>
  • Lines and paragraphs break automatically.
  • You can enable syntax highlighting of source code with the following tags: <code>, <blockcode>, <brainfuck>, <c>, <cpp>, <haskell>, <lua>, <php>.

More information about formatting options