R XML: How to Work With XML Files in R

Reading time:

time

min

tutorials

By:

Dario Radečić

November 2, 2022

R programming language can read all sorts of data, and <strong>XML</strong> is no exception. There are many ways to <strong>read</strong>, <strong>parse</strong>, and <strong>manipulate</strong> these <strong>markup language files in R</strong>, and today we'll explore two. By the end of the article, you'll know how to use two <strong>R packages to work with XML</strong>. We'll kick things off with an R XML introduction - you'll get a sense of what XML is, and we'll also write an XML dataset from scratch. Then, you'll learn how to access individual elements, convert XML files to an R <code>tibble</code> and a <code>data.frame</code>, and much more. <blockquote>Are you a complete beginner in R? <a href="https://appsilon.com/oop-in-r-with-r6/" target="_blank" rel="noopener">See how R handles Object-Oriented Programming (OOP) with R6</a>.</blockquote> Table of contents: <ul><li><a href="#introduction">Introduction to R XML</a></li><li><a href="#basics">R XML Basics - How to Read and Parse XML Files</a></li><li><a href="#dataframes">How to Convert XML Data to tibble and data.frame</a></li><li><a href="#summary">Summary of R XML</a></li></ul> <hr /> <h2 id="introduction">Introduction to R XML</h2> First, let's answer one important question: <b>What is XML?</b> The acronym stands for <i>Extensible Markup Language</i>. It's similar to HTML since they're both markup languages, but XML is used for storing and transmitting data over the internet. As you would assume, all XML files have an <code>.xml</code> file extension. <blockquote>Building an interactive map with R and Shiny? See if you should be <a href="https://appsilon.com/leaflet-vs-tmap-build-interactive-maps-with-r-shiny/" target="_blank" rel="noopener">using Leaflet vs Tmap</a>.</blockquote> When you first start working with XML files you'll immediately appreciate the structure. It's human-readable, and there aren't a gazillion of brackets as with JSON. There are no predefined tags, as in HTML. You can name your tags however you want, but it's best to name them around the business logic. All XML documents start with the following - the XML prolog: <pre><code class="language-xml"><?xml version="1.0" encoding="UTF-8"?></code></pre> Each XML file also must have a root element that can have one or many child notes. All child nodes may have sub-childs. Let's see this in action! The following code snippet declares an XML dataset containing employees. There's one root element - <code><records></code>, and each <code><employee></code> child has sub-childs, such as <code><last_name></code>: <pre><code class="language-xml"><?xml version="1.0" encoding="UTF-8"?> <records> <employee> <id>1</id> <first_name>John</first_name> <last_name>Smith</last_name> <position>CEO</position> <salary>10000</salary> <hire_date>2022-1-1</hire_date> <department>Management</department> </employee> <employee> <id>2</id> <first_name>Jane</first_name> <last_name>Sense</last_name> <position>Marketing Associate</position> <salary>3500</salary> <hire_date>2022-1-15</hire_date> <department>Marketing</department> </employee> <employee> <id>3</id> <first_name>Frank</first_name> <last_name>Brown</last_name> <position>R Developer</position> <salary>6000</salary> <hire_date>2022-1-15</hire_date> <department>IT</department> </employee> <employee> <id>4</id> <first_name>Judith</first_name> <last_name>Rollers</last_name> <position>Data Scientist</position> <salary>6500</salary> <hire_date>2022-3-1</hire_date> <department>IT</department> </employee> <employee> <id>5</id> <first_name>Karen</first_name> <last_name>Switch</last_name> <position>Accountant</position> <salary>4000</salary> <hire_date>2022-1-10</hire_date> <department>Accounting</department> </employee> </records></code></pre> Copy this file and save it locally - we've named it <code>data.xml</code>. You'll need it in the following section when we'll work with XML in R. But before we can do that, you'll have to install two R packages: <pre><code class="language-r">install.packages("xml2") install.packages("XML")</code></pre> Both are used to work with XML, and you can pretty much get around by using only the first. The second one has a couple of convenient functions for converting XML files, which we'll cover later. <blockquote>Want to add a Google Map to Shiny? <a href="https://appsilon.com/interactive-google-maps-with-r-shiny/" target="_blank" rel="noopener">Check out our guide to building interactive Google Maps with R Shiny!</a></blockquote> First things first, let's see how you can read and parse XML files in R. <h2 id="basics">R XML Basics - How to Read and Parse XML Files</h2> By now you should have the dataset downloaded and R packages installed. Create a new R script and use the following code to load in the packages and read the XML file: <pre><code class="language-r">library(xml2) library(XML) <br>employee_data <- read_xml("data.xml") employee_data </code></pre> Here's what it looks like: <img class="size-full wp-image-14388" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d30e29f75851d3d0390d_079ad41b_1-1.webp" alt="Image 1 - Contents of an XML document loaded into R" width="1924" height="268" /> Image 1 - Contents of an XML document loaded into R The data is all there, but it's unusable. You can make it usable by parsing the entire document or reading individual elements. Let's explore the parsing option first. Call the <code>xmlParse()</code> function and pass in <code>employee_data</code>: <pre><code class="language-r">employee_xml <- xmlParse(employee_data) employee_xml</code></pre> The contents now look like our source file: <img class="size-full wp-image-14390" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d30e7b3b08462744280c_ad325552_2-1.webp" alt="Image 2 - Parsed XML document" width="676" height="496" /> Image 2 - Parsed XML document <b>Pro tip:</b> if you don't care about the data, you can print the structure only. That's done with the <code>xml_structure()</code> function: <pre><code class="language-r">xml_structure(employee_data)</code></pre> <img class="size-full wp-image-14392" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d30fbc9849746c73dc5e_47a1a5ce_3.webp" alt="Image 3 - Structure of an XML document" width="348" height="372" /> Image 3 - Structure of an XML document If you want to access all elements with the same tag, you can use the <code>xml_find_all()</code> function. It returns both the opening and closing tags and any content that's between them: <pre><code class="language-r">xml_find_all(employee_data, ".//position")</code></pre> <img class="size-full wp-image-14394" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d3102eafb5d246304afc_551298d0_4-1.webp" alt="Image 4 - Accessing individual nodes" width="730" height="228" /> Image 4 - Accessing individual nodes In the case you only want the content, use either <code>xml_text()</code>, <code>xml_integer()</code>, or <code>xml_double()</code> function - depending on the underlying data type. The first one makes the most sense here: <pre><code class="language-r">xml_text(xml_find_all(employee_data, ".//position"))</code></pre> <img class="size-full wp-image-14396" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d311b9100faa4a29c91b_dad3418e_5-1.webp" alt="Image 5 - Getting values from individual nodes" width="1664" height="48" /> Image 5 - Getting values from individual nodes You now know how to do some basic R XML operations, but most of the time you want to convert these files to either a tibble or a data frame for easier access and manipulation. Let's see how to do that next. <h2 id="dataframes">How to Convert XML Data to tibble and data.frame</h2> Most of the time with R and XML you'll want to extract either all or a couple of features and turn them into a more readable format. We've already shown you how to use <code>xml_text()</code> to extract text from a specific element, and now we'll do a similar thing with integers. Then, we'll format these two attributes as a tibble. Here's the entire code snippet: <pre><code class="language-r">library(tibble) <br># Extract department and salary info dept <- xml_text(xml_find_all(employee_data, ".//department")) salary <- xml_integer(xml_find_all(employee_data, ".//salary")) <br># Format as a tibble df_dept_salary <- tibble(department = dept, salary = salary) df_dept_salary</code></pre> <img class="size-full wp-image-14398" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d311f870288447678d23_546451ad_6-1.webp" alt="Image 6 - Converting an XML document to an R tibble" width="326" height="312" /> Image 6 - Converting an XML document to an R tibble Now we have the department names and salaries for all employees. From here, it's easy to calculate the average salary per department (note that only the IT department occurs twice): <pre><code class="language-r">library(dplyr) <br># Group by department name to get average salary by department df_dept_salary %>% group_by(department) %>% summarise(salary = mean(salary))</code></pre> <img class="size-full wp-image-14400" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d312fb3d70e0efa005d2_37dcee61_7.webp" alt="Image 7 - Aggregations on an R tibble" width="322" height="272" /> Image 7 - Aggregations on an R tibble In case you want to convert the entire XML document to an R data.frame, look no further than the <code>XML</code> package. It has a convenient <code>xmlToDataFrame()</code> method that does the job perfectly: <pre><code class="language-r">df_employees <- xmlToDataFrame(nodes = getNodeSet(employee_xml, "//employee")) df_employees</code></pre> <img class="size-full wp-image-14402" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7d313f914760741ca7c41_92bfa9bf_8.webp" alt="Image 8 - Converting an XML document to an R data.frame" width="1170" height="270" /> Image 8 - Converting an XML document to an R data.frame That's all the loading and preprocessing needed before you can start analyzing and visualizing datasets. It's also the most common pipeline you'll have for loading XML files, so we'll end today's article here. <hr /> <h2 id="summary">Summary of R XML</h2> XML files are common in 2022 and you as a data professional must know how to work with them. Almost all R XML-related work you'll do boils down to loading and parsing XML documents and converting them to an analysis-friendly format. Today you've learned how to do that with two excellent R packages. For a homework assignment, try to read only the <code><hire_date></code> attribute, and make sure to parse it as a date. Is there a built-in function, or do you need to take an extra step? Make sure to let us know in the comment section below. <blockquote>Excel power user? <a href="https://appsilon.com/r-and-excel/" target="_blank" rel="noopener">You can combine R and Excel with these two packages</a>.</blockquote>

Have questions or insights?

Engage with experts, share ideas and take your data journey to the next level!

Stop Struggling with Outdated Clinical Data Systems

Join pharma data leaders from Jazz Pharmaceuticals and Novo Nordisk in our live podcast episode as they share what really works when building modern, compliant Statistical Computing Environments (SCEs).

Save My Spot

Is Your Software GxP Compliant?

Download a checklist designed for clinical managers in data departments to make sure that software meets requirements for FDA and EMA submissions.

Get the Checklist

Ensure Your R and Python Code Meets FDA and EMA Standards

A comprehensive diagnosis of your R and Python software and computing environment compliance with actionable recommendations and areas for improvement.