Ruby HTML and XML Parsers
Extracting data from the web—that is, web scraping—typically requires reading and processing content from HTML and XML documents. Parsers are software tools that facilitate this scraping of web pages.
The Ruby developer community offers some fantastic HTML and XML parsers that can serve all your web scraping needs—there are a lot of options out there. In choosing which to go with, you might consider the following criteria:
To help you decide on the right parser for your project, this article will analyze six HTML and XML parsing libraries in Ruby based on the above-mentioned criteria.
Nokogiri is an open source Ruby HTML and XML parsing library. Under the hood, it uses parsers like libxml2 and xerces to deliver its core functionalities. Besides providing an API that conforms to a strong security policy, it also offers features such as a DOM parser, SAX parser, Push parser, and XSLT transformation.
Nokogiri is the default HTML parser for Mechanize, a library for automating web browsers.
Here’s an example of using Nokogiri:
require 'nokogiri'
doc = "<html><head><title>page title</title></head><body>page content</body></html>"
parsed_data = Nokogiri::HTML.parse(doc)
puts parsed_data.title
=> "page title"
Hpricot is another open source Ruby HTML parsing library. It was originally designed to parse HTML documents, though it's technically capable of parsing XML documents as well. It locates elements in HTML documents precisely, so if you can see some content in an HTML document in your browser, Hpricot is sure to parse it.
Here’s an example of using Hpricot:
require 'hpricot'
doc = "<html><head><title>page title</title></head><body>page content</body></html>"
parsed_data = Hpricot(doc).at("title").inner_html
puts parsed_data
=> "page title"
Optimized XML (Ox), as its name suggests, is a fast and efficient Ruby XML parsing library with HTML handling capabilities. It is open source and provides a clean and simple API. Ox processes XML documents in three ways: as a generic XML parser and writer, as a fast Object / XML marshaller, and as a stream SAX parser. It aims to substitute Nokogiri and other Ruby XML parsers for XML parsing and Marshal for object serialization.
Here’s an example of using Ox:
require 'ox'
parsed_doc = Ox.parse(%{
<?xml?>
<Person>
<Name>Mary Active</Name>
<Age>21</Age>
</Person>
})
parsed_data = parsed_doc.Person.Name.text
puts parsed_data
=> “Mary Active”
Oga is another open source HTML and XML parser written in Ruby. It uses a small, native extension (C for MRI/Rubinius; Java for JRuby) to achieve better performance. It provides a simple API to parse, modify, and query documents via XPath, and it comes with support for XML namespaces (registering, querying, etc.).
Here’s an example of using Oga:
require 'oga'
parsed_doc = Oga.parse_xml(%{
<?xml?>
<Person>
<Name>Mary Active</Name>
<Age>21</Age>
</Person>
})
parsed_data = parsed_doc.at_xpath('Person/Name/text()')
puts parsed_data
=> “Mary Active”
LibXML Ruby offers a Ruby wrapper around the GNOME libxml2 XML library. It's open source and provides an incredible number of features, such as XML validation, XPath support, XSLT support, and more.
Here’s an example of using LibXML Ruby:
require 'xml'
parsed_doc = XML::Parser.string(%{
<Person>
<Name>Mary Active</Name>
<Age>21</Age>
</Person>
})
parsed_data = parsed_doc.parse.find('//Name').first.content
puts parsed_data
=> “Mary Active”
Selenium WebDriver is an open source browser automation tool that offers Ruby language bindings. It’s much more than an HTML parsing library—it comes with multi-browser support, handling of dynamic web elements, a wide range of locating strategies, mouse and keyboard events, and more. This makes it a popular choice among quality assurance engineers to perform test validations in web application testing.
Here’s an example of using Selenium WebDriver:
require 'selenium-webdriver'
require 'webdrivers/chromedriver'
driver = Selenium::WebDriver.for :chrome
driver.get('https://www.selenium.dev/selenium/web/web-form.html')
puts driver.title
=> "Web form"
Web scraping is very useful for extracting relevant, structured data from HTML and XML documents in an automated fashion, especially when you don't have access to a public API that provides this data. The Ruby developer community provides several libraries that make it easier to parse these HTML and XML documents. This article discussed six of them to give you a head start in selecting the right parser for your individual use case.
Each parsing library comes with a unique set of features that you may find worth trying, depending on your priorities. If you're looking for a well-rounded HTML and XML parser, consider using Nokogiri. It's one of the most popular choices and has tons of resources to help you if you get stuck. If you don’t want to use Nokogiri for HTML parsing, then you can consider Hpricot to achieve a similar level of precision. If speed is your top priority, give Ox a try. On the other hand, LibXML Ruby or Oga can be an excellent choice if you’re looking for a memory-efficient library. And finally, if you need to simulate browser actions to parse a document, Selenium WebDriver is your best bet.