Ruby HTML and XML Parsing

Ruby HTML and XML Parsers

Extracting data from the web—that is, web scraping—typically requires reading and processing content from HTML and XML documents. Parsers are software tools that facilitate this scraping of web pages.

The Ruby developer community offers some fantastic HTML and XML parsers that can serve all your web scraping needs—there are a lot of options out there. In choosing which to go with, you might consider the following criteria:

It must be open sourced.
It must have good community support.
It must have good performance.
It must be actively maintained.
It must be lightweight.
It must have good documentation.

To help you decide on the right parser for your project, this article will analyze six HTML and XML parsing libraries in Ruby based on the above-mentioned criteria.

Nokogiri

Nokogiri is an open source Ruby HTML and XML parsing library. Under the hood, it uses parsers like libxml2 and xerces to deliver its core functionalities. Besides providing an API that conforms to a strong security policy, it also offers features such as a DOM parser, SAX parser, Push parser, and XSLT transformation.

Nokogiri is the default HTML parser for Mechanize, a library for automating web browsers.

Here’s an example of using Nokogiri:

require 'nokogiri'

doc = "<html><head><title>page title</title></head><body>page content</body></html>"
parsed_data = Nokogiri::HTML.parse(doc)

puts parsed_data.title
=> "page title"

Pros

It's the most popular and widely-used Ruby library for HTML and XML parsing. You'll discover many discussions in question-and-answer forums (e.g. Stack Overflow), and you can easily find blog posts and tutorials online to help you get started.
It's fast and efficient due to leveraging the power of the native parsers, libxml2 and xerces.
It has frequent releases that address bug fixes and introduce updates and enhancements to the library.
It has a robust API that can easily deal with poorly-formatted HTML to minimize flakiness.
It carries out a search of contents in HTML and XML documents via XPath and CSS selectors.
It has comprehensive documentation, so you won't have trouble figuring out a particular functionality.

Cons

The primary limitation of Nokogiri is that it alone is not sufficient to parse a document whose content is loaded on a page using Ajax.
You may also face difficulties parsing a document if the website requires authentication, such as a username and a password.

Hpricot

Hpricot is another open source Ruby HTML parsing library. It was originally designed to parse HTML documents, though it's technically capable of parsing XML documents as well. It locates elements in HTML documents precisely, so if you can see some content in an HTML document in your browser, Hpricot is sure to parse it.

Here’s an example of using Hpricot:

require 'hpricot'

doc = "<html><head><title>page title</title></head><body>page content</body></html>"
parsed_data = Hpricot(doc).at("title").inner_html

puts parsed_data
=> "page title"

Pros

It's very fast, since the majority of its performance-related functionalities are written in C.
It only requires Ruby installation—no other dependencies—making it very lightweight.
It has a robust API that enables reading and parsing malformed HTML easily, just like Nokogiri.
It makes locating HTML elements simpler by utilizing XPath and CSS selectors.
It has good documentation to help you navigate the library.

Cons

It's no longer maintained and it doesn't have strong community support, so it's highly likely that you'll encounter issues while working with this library.
Technically, Hpricot isn’t designed for XML parsing. It doesn’t validate XML documents, so you may encounter problems with malformed XML.

Ox

Optimized XML (Ox), as its name suggests, is a fast and efficient Ruby XML parsing library with HTML handling capabilities. It is open source and provides a clean and simple API. Ox processes XML documents in three ways: as a generic XML parser and writer, as a fast Object / XML marshaller, and as a stream SAX parser. It aims to substitute Nokogiri and other Ruby XML parsers for XML parsing and Marshal for object serialization.

Here’s an example of using Ox:

require 'ox'

parsed_doc = Ox.parse(%{
<?xml?>
<Person>
    <Name>Mary Active</Name>
    <Age>21</Age>
</Person>
})
parsed_data = parsed_doc.Person.Name.text

puts parsed_data
=> “Mary Active”

Pros

It's significantly faster than Nokogiri and LibXML Ruby for XML parsing.
It's lightweight, since it requires no other dependencies. Also, version mismatches with libxml aren't an issue.
As one of the most popular Ruby libraries for XML parsing (second only to Nokogiri), it has great community support.
It provides good documentation for the available APIs.
It has several releases and is actively maintained with updates and enhancements.

Cons

It does not provide support for XPath, proper namespace, XSLT transformation, and so on.

Oga

Oga is another open source HTML and XML parser written in Ruby. It uses a small, native extension (C for MRI/Rubinius; Java for JRuby) to achieve better performance. It provides a simple API to parse, modify, and query documents via XPath, and it comes with support for XML namespaces (registering, querying, etc.).

Here’s an example of using Oga:

require 'oga'

parsed_doc = Oga.parse_xml(%{
<?xml?>
<Person>
    <Name>Mary Active</Name>
    <Age>21</Age>
</Person>
})
parsed_data = parsed_doc.at_xpath('Person/Name/text()')

puts parsed_data
=> “Mary Active”

Pros

It's easy to install on various platforms, since it doesn't require system libraries like libxml.
Its API enables users to safely parse and query documents in a multi-threaded environment.
It has a low memory footprint.
It comes with good documentation to help you learn and use the library.

Cons

It has very few recent releases and it isn't actively maintained, so it rarely receives updates and enhancements.

LibXML Ruby

LibXML Ruby offers a Ruby wrapper around the GNOME libxml2 XML library. It's open source and provides an incredible number of features, such as XML validation, XPath support, XSLT support, and more.

Here’s an example of using LibXML Ruby:

require 'xml'

parsed_doc = XML::Parser.string(%{
<Person>
    <Name>Mary Active</Name>
    <Age>21</Age>
</Person>
})
parsed_data = parsed_doc.parse.find('//Name').first.content

puts parsed_data
=> “Mary Active”

Pros

It's very fast since it's written in C.
It's actively maintained with updates and enhancements, and has frequent releases.
It has good documentation.
It performs auto memory management.

Cons

It relies on libxml2 to function properly, which in turn relies on libm, libz, and libiconv.
It's not a good choice for HTML parsing as it supports parsing HTML4 documents, while most modern browsers implement HTML5, which behaves differently.

Selenium WebDriver

Selenium WebDriver is an open source browser automation tool that offers Ruby language bindings. It’s much more than an HTML parsing library—it comes with multi-browser support, handling of dynamic web elements, a wide range of locating strategies, mouse and keyboard events, and more. This makes it a popular choice among quality assurance engineers to perform test validations in web application testing.

Here’s an example of using Selenium WebDriver:

require 'selenium-webdriver'
require 'webdrivers/chromedriver'

driver = Selenium::WebDriver.for :chrome
driver.get('https://www.selenium.dev/selenium/web/web-form.html')

puts driver.title
=> "Web form"

Pros

It's one of the most popular and widely used quality assurance tools, so it has a very active community support.
It’s supported by numerous how-to tutorials, blog posts, videos, and books, so you won't have any shortage of resources to learn and utilize this tool.
It provides very good documentation with examples.
It's actively maintained with updates and enhancements.
It supports eight traditional web element locating strategies: class name, CSS selector, id, name, link text, partial link text, tag name, and XPath.

Cons

As a general-purpose browser automation library that is mostly used for quality assurance, using Selenium WebDriver for HTML parsing may be overkill.
You have to install browser drivers to carry out its functionalities.
It's the slowest library on this list, since it needs to interact with the browser via the browser driver.
Selenium WebDriver doesn't support the auto-wait mechanism, which may lead to unstable scripts.

Conclusion

Web scraping is very useful for extracting relevant, structured data from HTML and XML documents in an automated fashion, especially when you don't have access to a public API that provides this data. The Ruby developer community provides several libraries that make it easier to parse these HTML and XML documents. This article discussed six of them to give you a head start in selecting the right parser for your individual use case.

Each parsing library comes with a unique set of features that you may find worth trying, depending on your priorities. If you're looking for a well-rounded HTML and XML parser, consider using Nokogiri. It's one of the most popular choices and has tons of resources to help you if you get stuck. If you don’t want to use Nokogiri for HTML parsing, then you can consider Hpricot to achieve a similar level of precision. If speed is your top priority, give Ox a try. On the other hand, LibXML Ruby or Oga can be an excellent choice if you’re looking for a memory-efficient library. And finally, if you need to simulate browser actions to parse a document, Selenium WebDriver is your best bet.

This blog post was originally published at: https://www.scrapingbee.com/blog/

#ruby #html #xml