Ruby Extension: HTML Truncation

Another tip (err hurdle) I came across during the production of this blog — truncating an HTML string. Easy, right?

It seems simple enough: shorten some basic text content from a long entry. It’s extremely popular in blogs, catalogs, portfolios, etc. and with good reason — the average browser wants to find content through screening, not mass scrolling.

But a good trimmer must keep a few things in mind.

  • Don’t split words
  • Recognize/respect HTML tags
  • Parse HTML according to standards

These add up to some pretty terse requirements once you actually get to coding. First, unless we want to manually parse HTML, we’ll have to use some standards based parser and loop through all the elements, until the specified number of characters/words (excluding tags!) is exceeded, at which point we append a user-defined tail and discard all other tags.

Update: the latest version of this handy widget is now available as a gem! Check it out at rubygems.org/gems/butter or bundle it with gem install butter.

First attempt

Solution 1 came from a blog, and was then heavily modified to make it work with a more modern interface.

require 'rexml/parsers/pullparser'
require 'htmlentities'

class String
  # Truncate strings containing HTML code
  # Usage example: "string".truncate_html(50, :word_cut => false, :tail => '[+]')
  def truncate_html(len = 30, opts = {})
    opts = {:word_cut => true, :tail => '…'}.merge(opts)
    p = REXML::Parsers::PullParser.new(self)
    coder = HTMLEntities.new
    tags = []
    new_len = len
    results = ''
    while p.has_next? && new_len > 0
      p_e = p.pull
      case p_e.event_type
      when :start_element
        tags.push p_e[0]
        results << "<#{tags.last} #{attrs_to_s(p_e[1])}>"
      when :end_element
        results << "</#{tags.pop}>"
      when :text
        text = coder.decode(p_e[0])
        if (text.length > new_len) and !opts[:word_cut]
          piece = text.first(text.index(' ', new_len))
        else
          piece = text.first(new_len)
        end
        results << coder.encode(piece)
        new_len -= text.length
      else
        results << "<!-- #{p_e.inspect} -->"
      end
    end
    tags.reverse.each do |tag|
      results << "</#{tag}>"
    end
    results << opts[:tail]

    if html_safe? then
      results.html_safe
    else
      results
    end
  end

  private

  def attrs_to_s(attrs)
    if attrs.empty?
      ''
    else
      attrs.to_a.map { |attr| %{#{attr[0]}="#{attr[1]}"} }.join(' ')
    end
  end
end

This worked great…at first. All went well in development, but once I launched into production and started using slightly more complex HTML, it completely crashed the index pages. What was the cause? An <em> tag. Why? I have no clue, but I know where the problem occured: the REXML parser. On top of being an outdated parser (I suppose that includes choking on em tags), it is one of the slowest parsers out there. So it looked like my nice little online script was about useless, and frankly it seemed way too complex/inelegant for our modern gem-based apps.

Final attempt

Thankfully though, with some more searching I found an elegant solution using Nokogiri (the best parser gem by far!) and some creative word boundary logic. The bulk of this code (and comments) was designed by Eleo, but I modified the interface a bit (instance method instead of class method) and added a few other tweaks.

require 'nokogiri'
require 'htmlentities'

class String
  def truncate_html(num_words = 30, opts = {})
    opts = {:word_cut => true, :tail => "&hellip;"}.merge(opts)
    tail = HTMLEntities.new.decode(opts[:tail])

    doc = Nokogiri::HTML(self)

    current = doc.children.first
    count = 0

    while true
      # we found a text node
      if current.is_a?(Nokogiri::XML::Text)
        count += current.text.split.length
        # we reached our limit, let's get outta here!
        break if count > num_words
        previous = current
      end

      if current.children.length > 0
        # this node has children, can't be a text node,
        # lets descend and look for text nodes
        current = current.children.first
      elsif !current.next.nil?
        #this has no children, but has a sibling, let's check it out
        current = current.next
      else
        # we are the last child, we need to ascend until we are
        # either done or find a sibling to continue on to
        n = current
        while !n.is_a?(Nokogiri::HTML::Document) and n.parent.next.nil?
          n = n.parent
        end

        # we've reached the top and found no more text nodes, break
        if n.is_a?(Nokogiri::HTML::Document)
          break;
        else
          current = n.parent.next
        end
      end
    end

    if count >= num_words
      unless count == num_words
      new_content = current.text.split

      # If we're here, the last text node we counted eclipsed the number of words
      # that we want, so we need to cut down on words.  The easiest way to think about
      # this is that without this node we'd have fewer words than the limit, so all
      # the previous words plus a limited number of words from this node are needed.
      # We simply need to figure out how many words are needed and grab that many.
      # Then we need to -subtract- an index, because the first word would be index zero.

      # For example, given:
      # <p>Testing this HTML truncater.</p><p>To see if its working.</p>
      # Let's say I want 6 words.  The correct returned string would be:
      # <p>Testing this HTML truncater.</p><p>To see...</p>
      # All the words in both paragraphs = 9
      # The last paragraph is the one that breaks the limit.  How many words would we
      # have without it? 4.  But we want up to 6, so we might as well get that many.
      # 6 - 4 = 2, so we get 2 words from this node, but words #1-2 are indices #0-1, so
      # we subtract 1.  If this gives us -1, we want nothing from this node. So go back to
      # the previous node instead.
      index = num_words-(count-new_content.length)-1
      if index >= 0
      new_content = new_content[0..index]
        current.content = new_content.join(' ') + tail
        else
        current = previous
        current.content = current.content + tail
      end
      end

      # remove everything else
      while !current.is_a?(Nokogiri::HTML::Document)
        while !current.next.nil?
          current.next.remove
        end
        current = current.parent
      end
    end

    # now we grab the html and not the text.
    # we do first because nokogiri adds html and body tags
    # which we don't want
    truncated = doc.root.children.first.children.first.inner_html

    if html_safe?
      truncated.html_safe
    else
      truncated
    end
  end
end

I’ve been really happy with the success of this particular implementation. In addition to using Nokogiri, the code is easier to understand and allows for the input of a number of words, rather than number of characters.

Future additions

One of the tweaks I made was the options hash, as I foresee eventually adding more options to the truncate operation, such as using a character count as the length, ability to split before the word boundary, etc. For the moment however, this approach has worked really well. Remember to add this code to string.rb in your lib directory, and include it with auto_require.

As always, if you guys have any other options you think might be useful to the truncate_html method, or alternate solutions for that matter, mention it in the comments!