Is Your Company Giving You What You Need to Build Great R Shiny Apps?

Reading time:

time

min

March 29, 2023

Is your <strong>R Shiny app slow</strong>, but you doubt Shiny itself is the <strong>bottleneck</strong>? This is a likely scenario, especially if you're given a <strong>slow database</strong> with no <strong>indexes</strong> or <strong>normalization</strong>. Data retrieval and database communication in general become an issue, and it has nothing to do with R Shiny. Today you'll learn how to build great R Shiny apps by solving the most common slow database problem - indexes and normalization. We'll first provision a free Postgres database on AWS, import some data, and then work extensively on this issue. There's even a section on preparing your data in advance, so the resulting Shiny app doesn't have to implement complex SQL statements. Let's get started! <blockquote>Looking to build a production-ready Shiny app? <a href="https://appsilon.com/rhino-r-package-tutorial/" target="_blank" rel="noopener">Try Rhino - An R package from Appsilon</a>.</blockquote> Table of contents: <ul><li><a href="#database-setup">Database Setup, Connection, and Data Import</a></li><li><a href="#indexes">Database Indexes - Build Great R Shiny Apps By Speeding up Your Queries up to 5000 Times</a></li><li><a href="#normalization">Database Normalization - From Theory to Implementation</a></li><li><a href="#views">Preparing Data in Advance - A Must If You Want to Build Great R Shiny Apps</a></li><li><a href="#summary">Summing Up How to Build Great R Shiny Apps</a></li></ul> <hr /> <h2 id="database-setup">Database Setup, Connection, and Data Import</h2> Setting up databases is now easier than ever, and most cloud providers will allow you to use one database instance free of charge. <a href="http://aws.amazon.com/" target="_blank" rel="noopener">AWS</a> is one such cloud provider, and we'll use it to run a free-tier PostgreSQL database. <blockquote>Concerned with your data quality? <a href="https://appsilon.com/automated-r-data-quality-reporting/" target="_blank" rel="noopener">Automate data quality reports with the R package: data.validator</a>.</blockquote> It's not an entirely straightforward process, but we'll walk you through each step. The only prerequisite is that you have an AWS account registered. <h3>AWS Security Group for Database Access</h3> Before even provisioning a database, you have to create a new <b>Security Group</b>. Put simply, this will allow you to <i>tag</i> the database, so specific inbound/outbound rules apply to it. You want to allow all traffic from anywhere, just as shown in the image below: <img class="size-full wp-image-18276" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7af4083a87261755609e8_07ac5663_1.webp" alt="Image 1 - AWS security group that allows all traffic from anywhere" width="3110" height="2298" /> Image 1 - AWS security group that allows all traffic from anywhere As you can see, we named our security group <code>r-postgres</code>, and this name is a vital piece of information to remember for the following step. You could be more restrictive if this is a production application, but this simple rule will serve us for now. Up next, you can start provisioning the database. <h3>Provisioning a Free-Tier PostgreSQL Database</h3> Under <b>RDS</b>, provision a new database and make sure to select <b>PostgreSQL</b> as a database vendor, and <b>Free tier</b> under templates: <img class="size-full wp-image-18278" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7af41db9f11d130ce2fdd_43ca5144_2.webp" alt="Image 2 - Provisioning a new PostgreSQL database (1)" width="3110" height="2298" /> Image 2 - Provisioning a new PostgreSQL database (1) If you scroll down a bit, you'll see two essential things to configure. The first one is <b>Public access</b>. You must select "Yes" as an option, otherwise, the database won't get a public IP you can use for the connection. Also, make sure to select <code>r-postgres</code> as a <b>VPC security group</b>: <img class="size-full wp-image-18280" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7af41b54b1c0441e988e1_2f5993cf_3.webp" alt="Image 3 - Provisioning a new PostgreSQL database (2)" width="3110" height="2298" /> Image 3 - Provisioning a new PostgreSQL database (2) You'll also have to specify a username/password combination for the database instance, so make sure to go with something you'll remember. That's everything you need to change, so scroll to the bottom and click on the big orange button to create the database instance. You'll see a screen like this one after a couple of minutes: <img class="size-full wp-image-18282" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7af42e02dfe29f7317171_f26a3ba3_4.webp" alt="Image 4 - Provisioning a new PostgreSQL database (3)" width="3110" height="2298" /> Image 4 - Provisioning a new PostgreSQL database (3) Note the <b>Endpoint</b> and <b>Port</b> values - these are essential for establishing connections either from R or from GUI tools. Let's see how next. <h3>Database Connection from a GUI Tool</h3> We're using a free version of <a href="http://tableplus.com/" target="_blank" rel="noopener">TablePlus</a> as a GUI tool for database communication and SQL execution. You are free to use any other tool, of course. In TablePlus, create a new connection to PostgreSQL and paste your specific Host/Port/User/Password values - you know these from the previous section: <img class="size-full wp-image-18284" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7af43bb4e9308f9a4b610_fe2f54f7_5.webp" alt="Image 5 - TablePlus PostgreSQL connection" width="1224" height="1008" /> Image 5 - TablePlus PostgreSQL connection The fields will light up green when you click on "Test" if the connection succeeds. If that's the case, click on "Connect" and you're ready to go. If the fields light up red, you entered some of the values incorrectly, or the database instance hasn't finished provisioning yet. Make sure to double-check. Up next, let's see how to establish a database connection from R. <h3>Database Connection from R</h3> R needs an additional <code>DBI</code> package in order to establish a connection with various databases, PostgreSQL being one of them. Simply install it by running <code>install.packages("DBI")</code> and you'll be good to go. Regarding the connection itself, refer to the snippet below. It establishes a connection to a remote AWS RDS Database, runs a simple query, prints the result, and disconnects from the database. Of course, don't forget to fill out the connection details first: <pre><code class="language-r">library(DBI) <br>conn <- dbConnect( RPostgres::Postgres(), dbname = "postgres", host = "", port = "", user = "", password = "" ) <br>q <- dbGetQuery( conn = conn, statement = "SELECT 1 AS myid" ) <br>print(q) <br>dbDisconnect(conn = conn)</code></pre> If everything went well, you'll see the following printed to the R console: <img class="size-full wp-image-18286" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7af432814a08d133da869_bb7ae465_6.webp" alt="Image 6 - PostgreSQL connection through R" width="232" height="126" /> Image 6 - PostgreSQL connection through R We now have everything the connection established both through R and a GUI tool, so there's only one more thing needed before exploring how to build great R Shiny apps - the data itself. Let's take care of it next. <h3>Data Creation and Import to PostgreSQL Database</h3> Most real-world applications summarize huge volumes of data and display summary statistics through text, tables, and charts. We're somewhat limited to the volume of data in a free-tier database, but we can still create and transfer reasonably large datasets. We'll mix things up technology-wise by using Python to create the data. If you don't have Python installed, or don't want to use it to create the dataset, simply <a href="https://wordpress.appsilon.com/wp-content/uploads/2023/03/employee_data.csv" target="_blank" rel="noopener">download it from our servers</a>. The following Python script will create a 1M row dataset containing dummy, random employee data, parse it with <code>pandas</code> and save it locally in CSV file format: <pre><code class="language-python">import numpy as np import pandas as pd import names <br> department_options = ["Sales", "Marketing", "IT", "Management", "Accounting"] <br>def get_salary(department: str) -> float: mappings = { "Sales": 1.0, "Marketing": 1.25, "IT": 2, "Management": 3, "Accounting": 1.5 } base_sal = np.random.randint(low=3000, high=5000) try: return base_sal * mappings[department] except Exception as e: return base_sal if __name__ == "__main__": _ids = np.arange(1, 1000001) _first_names = [names.get_first_name() for name in range(len(_ids))] _last_names = [names.get_last_name() for name in range(len(_ids))] _departments = [np.random.choice(department_options) for dept in range(len(_ids))] _salaries = [get_salary(dept) for dept in _departments] _years_experience = [int(sal // 1000) for sal in _salaries] df = pd.DataFrame({ "id": _ids, "first_name": _first_names, "last_name": _last_names, "department": _departments, "salary": _salaries, "years_experience": _years_experience }) df.to_csv("/path/to/Desktop/employee_data.csv", index=False)</code></pre> Here's what the first couple of rows looks like: <img class="size-full wp-image-18288" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7af4483a8726175560ace_3a03ca8e_7.webp" alt="Image 7 - Head of the employees dataset" width="1328" height="492" /> Image 7 - Head of the employees dataset The next step is to create a database table and import the dataset. Run the following SQL code for table creation: <pre><code class="language-sql">CREATE TABLE employees( emp_id INTEGER NOT NULL, emp_first_name VARCHAR(64), emp_last_name VARCHAR(64), emp_department VARCHAR(32), emp_salary REAL, emp_years_experience INTEGER );</code></pre> And now right-click on the table from the left menu, and click on <b>Import</b>. Provide a path to the CSV file, and map the table columns to the columns found in the CSV file: <img class="size-full wp-image-18290" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7af45537ef68dd1d0e86d_329cc553_8.webp" alt="Image 8 - CSV to database table column mappings" width="2494" height="1654" /> Image 8 - CSV to database table column mappings Once done, click on "Import" - it will take a couple of minutes: <img class="size-full wp-image-18292" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7af4674fe971f0385d7aa_447771e5_9.webp" alt="Image 9 - Importing local data to a PostgreSQL database" width="2494" height="1654" /> Image 9 - Importing local data to a PostgreSQL database And that's it - the dataset is now loaded into the table and we can start optimizing it for better R Shiny performance. <h2 id="indexes">Database Indexes - Build Great R Shiny Apps By Speeding up Your Queries up to 5000 Times</h2> You can think of SQL table indexes like index page(s) in a book. Essential topics and ideas are listed at the end or beginning, so you can find the topics you need in no time. In other words, you can <i>scan</i> the index to find topics of interest and their respective location. The same methodology applies to database table indexes. If the table is indexed properly, it will alter the execution plan to retrieve the data as quickly as possible. This is one of the best and most guaranteed ways to build great R Shiny apps and improve Shiny app performance overall. But what about the negative sides? SQL indexes aren't without flaws, and the biggest considerations come down to additional storage space in your database, and slower insertion/update operations, as the index has to be updated every time. Keep these in mind. There are many types of SQL table indexes, and we'll explore four of them in this section. But first, let's <i>duplicate</i> our <code>employees</code> table so we can easily compare performances between queries on indexed and non-indexed tables: <pre><code class="language-sql">CREATE TABLE employees_indexed AS SELECT * FROM employees;</code></pre> Once done, proceed to the unique index section. <h3>Unique Index</h3> This type of index provides an index for each unique value. It's typically associated with primary keys. Here's how you can add it to our ID column: <pre><code class="language-sql">CREATE UNIQUE INDEX ON employees_indexed(emp_id);</code></pre> PostgreSQL allows us to compare performance and execution plans by putting the <code>EXPLAIN ANALYZE</code> keywords in front of a <code>SELECT</code> statement. We'll use it to compare performances when fetching an employee with a specific <code>emp_id</code> value. First, for the non-indexed table: <pre><code class="language-sql">EXPLAIN ANALYZE SELECT * FROM employees WHERE emp_id = 578652;</code></pre> <img class="size-full wp-image-18294" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7af46849aa4fb70c955cf_84d2d6bd_10.webp" alt="Image 10 - Unique index (1)" width="2612" height="1714" /> Image 10 - Unique index (1) In absolute terms, 181 ms is next to nothing, but gathering data from dozens of SQL queries quickly adds up. Let's run the same on the indexed table and compare the relative scores: <pre><code class="language-sql">EXPLAIN ANALYZE SELECT * FROM employees_indexed WHERE emp_id = 578652;</code></pre> <img class="size-full wp-image-18296" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7af47c0dc47f4ac1e7e6c_9743bbef_11.webp" alt="Image 11 - Unique index (2)" width="2612" height="1714" /> Image 11 - Unique index (2) Put simply, the indexed version is thousands of times faster, which is an amazing performance optimization considering it only took us one line of SQL code to create the index. This is because the non-indexed implementation requires a full-table scan to find the record, while the indexed implementation uses a much faster index scan. Let's go over non-unique indexes next. <h3>Non-Unique Index</h3> This type of index is ideal when the data in a given column is not unique but often used in the <code>WHERE</code> clause. It's a default type of index created by PostgreSQL, so to add it, just drop the <code>UNIQUE</code> keyword. Let's create one on the last name column: <pre><code class="language-sql">CREATE INDEX index_employees_last_name ON employees_indexed(emp_last_name);</code></pre> Let's now fetch some records and put the <code>emp_last_name</code> as a filter condition in the <code>WHERE</code> clause: <pre><code class="language-sql">EXPLAIN ANALYZE SELECT emp_last_name, emp_department, emp_salary FROM employees WHERE emp_last_name = 'Anderson';</code></pre> <img class="size-full wp-image-18298" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7af47c0dc47f4ac1e7e85_c588f10e_12.webp" alt="Image 12 - Non-unique index (1)" width="2612" height="1714" /> Image 12 - Non-unique index (1) The indexed implementation groups data into heaps behind the scenes, so the execution plan will change and the overall results will be returned much faster: <pre><code class="language-sql">EXPLAIN ANALYZE SELECT emp_last_name, emp_department, emp_salary FROM employees_indexed WHERE emp_last_name = 'Anderson';</code></pre> <img class="size-full wp-image-18300" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7af485d4d39bf0cf1ba0a_1961a96d_13.webp" alt="Image 13 - Non-unique index (2)" width="2612" height="1714" /> Image 13 - Non-unique index (2) It's not a couple of thousands of times faster, but the difference is still quite significant. <h3>Multi-Column Index</h3> No one is stopping you from adding a single index to multiple table columns. This is a de-facto practice when multiple columns are oftentimes used to filter out records in a <code>WHERE</code> clause. If you're using multiple filter conditions, consider adding a multi-column index to these columns. Let's create by combining <code>emp_last_name</code> and <code>emp_department</code>: <pre><code class="language-sql">CREATE INDEX index_employees_last_name_dept ON employees_indexed(emp_last_name, emp_department);</code></pre> And now let's fetch some records by filtering down on these two conditions - last name and department: <pre><code class="language-sql">EXPLAIN ANALYZE SELECT * FROM employees WHERE emp_last_name = 'Anderson' AND emp_department = 'Management';</code></pre> <img class="size-full wp-image-18302" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7af48c683397a52de3b70_404921e4_14.webp" alt="Image 14 - Multi-column index (1)" width="2612" height="1714" /> Image 14 - Multi-column index (1) Not too bad, 232 ms, but indexed implementation takes this to a whole new level: <pre><code class="language-sql">EXPLAIN ANALYZE SELECT * FROM employees_indexed WHERE emp_last_name = 'Anderson' AND emp_department = 'Management';</code></pre> <img class="size-full wp-image-18304" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b7af496c7eb174b189b1c7_8768613c_15.webp" alt="Image 15 - Multi-column index (2)" width="2612" height="1714" /> Image 15 - Multi-column index (2) Under a single millisecond, that's impressive! If you think these results are great, wait until you see what a sorted index can do. <h3>Sorted Index</h3> By default, an index is sorted in ascending order. But what if you're typically looking for the last records, like employee names that start with a "Z"? Well, adding a sorted index with descending sorting is the way to go. Let's add one on the <code>emp_first_name</code> column: <pre><code class="language-sql">CREATE INDEX index_employees_first_name_desc ON employees_indexed(emp_first_name DESC NULLS LAST);</code></pre> Now we'll run a query that keeps only the employees named "Zack": <pre><code class="language-sql">EXPLAIN ANALYZE SELECT * FROM employees WHERE emp_first_name = 'Zack';</code></pre> <img class="size-full wp-image-18306" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b29f69d483ce82c164e316_16.webp" alt="Image 16 - Sorted index (1)" width="2612" height="1714" /> Image 16 - Sorted index (1) Not too bad considering we have 1M records in the table. But the indexed implementation makes this 175 ms look like a joke: <pre><code class="language-sql">EXPLAIN ANALYZE SELECT * FROM employees_indexed WHERE emp_first_name = 'Zack';</code></pre> <img class="size-full wp-image-18308" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b29f6ca993e4be789c35b7_17.webp" alt="Image 17 - Sorted index (2)" width="2612" height="1714" /> Image 17 - Sorted index (2) These are the four essential types of SQL indexes you must know. Next, let's see how they help you build great R Shiny apps. <h3>Leveraging Indexed Data in R Shiny</h3> It goes without saying, but indexes are made on SQL tables and have nothing to do with R and Shiny. This is the whole idea behind fixing the data problems at the source, and not blaming the dashboard framework. The dashboard you're about to see allows the user to select minimum and maximum employee ID, and update the table data with a press of the action button. We're using the <code>DT</code> package for table data visualization, but there are other options: <blockquote>Need to render a table in R Shiny? <a href="https://appsilon.com/top-r-packages-for-table-data/" target="_blank" rel="noopener">Consider these top R Package for visualizing Table Data</a>.</blockquote> The Shiny app establishes a database connection on the app run and kills it when the app is closed. This way, you don't have to worry about leaving the connection open by accident. Anyhow, here's the code: <pre><code class="language-r">library(DBI) library(DT) library(shiny) <br>conn <- dbConnect( RPostgres::Postgres(), dbname = "postgres", host = "", port = "", user = "", password = "" ) <br> ui <- fluidPage( sidebarLayout( sidebarPanel( tags$h3("Shiny PostgreSQL"), tags$br(), numericInput(inputId = "minId", label = "Min ID:", value = 100000, min = 1, max = 1000000, step = 1), numericInput(inputId = "maxId", label = "Max ID:", value = 110000, min = 1, max = 1000000, step = 1), actionButton(inputId = "btn", label = "Fetch") ), mainPanel( textOutput(outputId = "outDuration"), DTOutput(outputId = "outTable") ) ) ) <br> server <- function(input, output, session) { v <- reactiveValues( minId = NULL, maxId = NULL, tableName = NULL, timeStart = NULL, timeEnd = NULL ) observeEvent(input$btn, { v$minId <- input$minId v$maxId <- input$maxId }) data <- reactive({ v$timeStart <- Sys.time() if (is.null(v$minId) || is.null(v$maxId)) return() q <- dbGetQuery( conn = conn, statement = paste( "SELECT * FROM employees_indexed WHERE emp_id BETWEEN", v$minId, "AND", v$maxId ) ) v$timeEnd <- Sys.time() data.frame(q) }) output$outDuration <- renderText({ if (is.null(v$timeStart) || is.null(v$timeEnd)) return() duration <- v$timeEnd - v$timeStart paste("Duration:", duration, "seconds.") }) output$outTable <- renderDT({ datatable( data = data(), filter = "top" ) }) <br> session$onSessionEnded(function() { dbDisconnect(conn = conn) print("Database connection closed.") }) } <br> shinyApp(ui = ui, server = server)</code></pre> Let's run the app to see what we're dealing with: <img class="size-full wp-image-18310" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01b85e196d46b4e1b9c87_18.gif" alt="Image 18 - R Shiny application with indexed SQL table data" width="1330" height="802" /> Image 18 - R Shiny application with indexed SQL table data Overall, the app is fast, which perfectly demonstrates that R Shiny is a fast framework for building apps and dashboards. What might slow it down is unoptimized databases and data sources, so make sure to keep yours in order. <h2 id="normalization">Database Normalization - From Theory to Implementation</h2> Database normalization is a technique used when designing databases to reduce data redundancy, divide large (flat) tables into smaller ones, and link them using relationships. Normalization is often described with <b>Normal Forms (NF)</b>. They range from 1 to 6, but in practice, normalization is typically achieved at the 3rd Normal Form (NF3). The third normal form assumes the data prior to normalization is already in the second normal form. You have to go sequentially or go through normal forms in order. Here's what it means in a practical sense: <ul><li><b>First Normal Form (1NF)</b><ul><li>Each column must contain only a single value</li><li>Each record must be unique</li></ul> </li> <li><b>Second Normal Form (2NF)</b> <ul><li>Must be in 1NF</li><li>The table must have a single-column primary key that does not depend on any subset of candidate key relations</li></ul> </li> <li><b>Third Normal Form (3NF)</b> <ul><li>Must be in 2NF</li><li>The table must not have transitive functional dependencies</li></ul> </li> </ul> In short, one table must represent one entity (sale, customer, employee), and the tables have to be linked with foreign keys. This isn't an in-depth course in database normalization, so we'll stop here. If you want to learn more, this <a href="https://www.guru99.com/database-normalization.html" target="_blank" rel="noopener">article from guru99.com</a> might help. We'll now focus on creating and storing a denormalized data representation to see in practical terms what's wrong with it. <h3>Create and Load Denormalized Sales Data</h3> In this section, we'll once again use Python to create a denormalized dataset. It will contain details on an employee, their salary, office branch, sale made, and sale details. You can download the dataset from the <a href="https://wordpress.appsilon.com/wp-content/uploads/2023/03/sales_data_denormalized.csv" target="_blank" rel="noopener">Appsilon server</a> if you don't want to create it manually. Here's the dataset creation code: <pre><code class="language-python">import random import numpy as np import pandas as pd from datetime import datetime, timedelta <br> def get_branch_details(branch_name: str) -> dict: mappings = { "New York": {"branch_manager": "John Johnson", "n_employees": 32}, "San Francisco": {"branch_manager": "Jane Janeson", "n_employees": 26} } return mappings[branch_name] <br> def generate_sale() -> dict: customers = ["Company A", "Company B", "Company C"] quantities = np.arange(1, 101) unit_prices = [5, 10, 15, 20, 25] def get_random_date(min_date: datetime = datetime.now() - timedelta(days=365), max_date: datetime = datetime.now()) -> datetime: delta = max_date - min_date int_delta = (delta.days * 24 * 60 * 60) + delta.seconds random_second = random.randrange(int_delta) return min_date + timedelta(seconds=random_second) c_customer = random.choice(customers) c_quantity = random.choice(quantities) c_unit_price = random.choice(unit_prices) c_total_price = c_quantity * c_unit_price c_sale_date = get_random_date() return { "sale_customer": c_customer, "sale_quantity": c_quantity, "sale_unit_price": c_unit_price, "sale_total_price": c_total_price, "sale_date": c_sale_date } <br> if __name__ == "__main__": employees = [ {"employee_id": 1, "first_name": "Bob", "last_name": "Scott", "hire_date": "2020-01-05", "email": "bob@company.com", "salary": 4000, "office_branch": "New York"}, {"employee_id": 2, "first_name": "Mark", "last_name": "Statson", "hire_date": "2018-09-01", "email": "mark@company.com", "salary": 6000, "office_branch": "San Francisco"}, {"employee_id": 3, "first_name": "Marry", "last_name": "June", "hire_date": "2015-01-15", "email": "marry@company.com", "salary": 7000, "office_branch": "New York"}, {"employee_id": 4, "first_name": "Dakota", "last_name": "Press", "hire_date": "2022-06-01", "email": "dakota@company.com", "salary": 2500, "office_branch": "New York"}, {"employee_id": 5, "first_name": "Dylan", "last_name": "Star", "hire_date": "2022-11-15", "email": "dylan@company.com", "salary": 2000, "office_branch": "San Francisco"} ] records_per_employee = {1: 100, 2: 225, 3: 550, 4: 80, 5: 65} sales_data = [] for i, emp in enumerate(employees): n_records = records_per_employee[emp["employee_id"]] for record in range(n_records): sales_data.append({ **emp, **get_branch_details(branch_name=emp["office_branch"]), **generate_sale() }) sales_data_df = pd.DataFrame(sales_data) sales_data_df.to_csv("/path/to/Desktop/sales_data_denormalized.csv", index=False)</code></pre> Here's what a sample of 15 records looks like: <img class="size-full wp-image-18312" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b29fa0492b6bd1d0b69901_19.webp" alt="Image 19 - Sample of our denormalized dataset" width="3530" height="1062" /> Image 19 - Sample of our denormalized dataset This dataset is considered to be denormalized. <b>Why?</b> Well, it's one table that contains information on multiple entities - employee, customer, office branch, and sale. You can load the data into the database by importing the CSV file, just like before, so we won't go over the process again. Let's work on its normalization next. <h3>Database Normalization in Action</h3> So far, this is what our denormalized table looks like: <img class="size-full wp-image-18314" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b2a07d443b637e3012c14b_20.webp" alt="Image 20 - Denormalized table structure" width="2204" height="1486" /> Image 20 - Denormalized table structure We want to normalize it by creating a table for each entity and assigning an auto-incremented ID column that will serve as a foreign key to other tables. In other words, this is the normalized structure we're aiming for: <img class="size-full wp-image-18316" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b2a07dc684138f31c8be9c_21.webp" alt="Image 21 - Normalized table structure" width="2228" height="1626" /> Image 21 - Normalized table structure To achieve this, let's first create the tables and connect them via the <code>FOREIGN KEY</code> constraints: <pre><code class="language-sql">CREATE TABLE customers( cst_id SERIAL PRIMARY KEY, cst_name VARCHAR(64) ); <br>CREATE TABLE office_branches( obr_id SERIAL PRIMARY KEY, obr_name VARCHAR(64), obr_branch_manager VARCHAR(64), obr_n_employees INTEGER ); <br>CREATE TABLE employees( emp_id SERIAL PRIMARY KEY, emp_first_name VARCHAR(32), emp_last_name VARCHAR(32), emp_hire_date DATE, emp_email VARCHAR(64), emp_salary REAL, emp_obr_id INTEGER, CONSTRAINT fk_emp_office_branches FOREIGN KEY(emp_obr_id) REFERENCES office_branches(obr_id) ); <br>CREATE TABLE sales( sal_id SERIAL PRIMARY KEY, sal_emp_id INTEGER, sal_cst_id INTEGER, sal_quantity INTEGER, sal_unit_price REAL, sal_total_price REAL, sal_date DATE, CONSTRAINT fk_sal_employees FOREIGN KEY(sal_emp_id) REFERENCES employees(emp_id), CONSTRAINT fk_sal_customers FOREIGN KEY(sal_cst_id) REFERENCES customers(cst_id) );</code></pre> The tables are now created but are empty at the moment. The goal is for each table to <b>contain only the unique elements</b>, referenced from the main <code>sales</code> table via foreign keys. We have to start from the simplest lookup tables first because tables with foreign keys will expect a connection to these tables upon insertion: <pre><code class="language-sql">INSERT INTO customers(cst_name) SELECT DISTINCT(sale_customer) FROM sales_data_denormalized ORDER BY 1; <br> INSERT INTO office_branches(obr_name, obr_branch_manager, obr_n_employees) SELECT DISTINCT office_branch, branch_manager, n_employees FROM sales_data_denormalized ORDER BY 1; <br> INSERT INTO employees(emp_first_name, emp_last_name, emp_hire_date, emp_email, emp_salary, emp_obr_id) SELECT DISTINCT first_name, last_name, hire_date, email, salary, (SELECT obr_id FROM office_branches WHERE obr_name = office_branch) AS branch_id FROM sales_data_denormalized; <br> INSERT INTO sales(sal_emp_id, sal_cst_id, sal_quantity, sal_unit_price, sal_total_price, sal_date) SELECT (SELECT emp_id FROM employees WHERE emp_email = email) AS employee_id, (SELECT cst_id FROM customers WHERE cst_name = sale_customer) AS customer_id, sale_quantity, sale_unit_price, sale_total_price, sale_date FROM sales_data_denormalized ORDER BY 1;</code></pre> This wasn't so difficult, was it? We'll now issue a couple of queries to verify everything is working correctly. <h3>Querying Normalized Database Tables from a GUI</h3> Running SQL queries on normalized tables isn't as convenient for a developer as fetching rows from a single table. You now have to join multiple tables with foreign keys, and the entire SQL statement will be bulkier for that reason. Nevertheless, this is a way to go if you want to respect industry standards, and conventions, and want to optimize database performance. The first query we'll run will return the employee's full name, office branch, and the total amount of sales (USD) after January 1st, 2023. We're only outputting employees in the New York branch for easier integration with Shiny later on: <pre><code class="language-sql">SELECT e.emp_first_name || ' ' || e.emp_last_name AS employee_name, o.obr_name AS office_branch, SUM(s.sal_total_price) AS amount_sold FROM sales s JOIN employees e ON s.sal_emp_id = e.emp_id JOIN office_branches o ON e.emp_obr_id = o.obr_id WHERE s.sal_date >= TO_DATE('2023-01-01', 'YYYY-MM-DD') AND o.obr_name = 'New York' GROUP BY e.emp_first_name || ' ' || e.emp_last_name, o.obr_name ORDER BY 3 DESC;</code></pre> Here's the output: <img class="size-full wp-image-18318" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b2a07ec32658f042fa8e71_22.webp" alt="Image 22 - Total sales by employee" width="2862" height="1846" /> Image 22 - Total sales by employee Everything looks right, so let's proceed with the second SQL query. It will provide a summary of sales by month in the New York branch: <pre><code class="language-sql">SELECT DATE_TRUNC('MONTH', s.sal_date)::DATE AS period, SUM(s.sal_total_price) AS total_sold FROM sales s JOIN employees e ON s.sal_emp_id = e.emp_id JOIN office_branches o ON e.emp_obr_id = o.obr_id WHERE o.obr_name = 'New York' GROUP BY DATE_TRUNC('MONTH', s.sal_date)::DATE ORDER BY 1;</code></pre> <img class="size-full wp-image-18320" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b2a07fafe4df21c48b2f63_23.webp" alt="Image 23 - Sales total over time for the New York branch" width="2862" height="1846" /> Image 23 - Sales total overtime for the New York branch This will make for a great-looking line chart in the following section. <h3>Querying Normalized Database Tables from R Shiny</h3> Now it's time to see the results of database normalization in R Shiny. Normalization is mandatory if you want to build great R Shiny apps, but it doesn't necessarily have to come from your end. If you're an <a href="https://appsilon.com/r-for-programmers/" target="_blank" rel="noopener">R developer</a>, you can just ask your database manager or database engineer whether the database tables are properly normalized. Back to the topic now. The dashboard code you'll see below allows the user to filter the record based on the office branch (either New York or San Francisco). As soon as the update is made, two queries from the previous section are run. The results from the first query are represented as a data table, and the second as a line chart. <blockquote>Wait, what is a line chart? <a href="https://appsilon.com/ggplot2-line-charts/" target="_blank" rel="noopener">Read our complete guide to making and styling line charts with R and ggplot2</a>.</blockquote> The server logic mostly boils down to writing SQL statements, which is something we'll optimize in the following section. For now, bear with us and run the following code: <pre><code class="language-r">library(DBI) library(DT) library(ggplot2) library(shiny) <br># Global database connection conn <- dbConnect( RPostgres::Postgres(), dbname = "postgres", host = "", port = "", user = "", password = "" ) <br> ui <- fluidPage( sidebarLayout( sidebarPanel( tags$h3("Shiny PostgreSQL"), tags$br(), selectInput(inputId = "selBranch", label = "Branch:", choices = c("New York", "San Francisco"), selected = "New York") ), mainPanel( # Table for the first query tags$div( tags$h4("2023 Statistics for a Selected Branch"), DTOutput(outputId = "table2023Stats"), ), # Chart for the second query tags$div( tags$h4("Sales by Month for a Selected Branch"), plotOutput(outputId = "plotTimeSeries") ) ) ) ) <br> server <- function(input, output, session) { # Get table data data_stats <- reactive({ q <- dbGetQuery( conn = conn, statement = paste0( "SELECT ", "e.emp_first_name || ' ' || e.emp_last_name AS employee_name, ", "o.obr_name AS office_branch, ", "SUM(s.sal_total_price) AS amount_sold ", "FROM sales s ", "JOIN employees e ON s.sal_emp_id = e.emp_id ", "JOIN office_branches o ON e.emp_obr_id = o.obr_id ", "WHERE s.sal_date >= TO_DATE('2023-01-01', 'YYYY-MM-DD') ", "AND o.obr_name = '", input$selBranch, "' GROUP BY e.emp_first_name || ' ' || e.emp_last_name, o.obr_name ", "ORDER BY 3 DESC" ) ) data.frame(q) }) # Get chart data data_graph <- reactive({ q <- dbGetQuery( conn = conn, statement = paste0( "SELECT ", "DATE_TRUNC('MONTH', s.sal_date)::DATE AS period, ", "SUM(s.sal_total_price) AS total_sold ", "FROM sales s ", "JOIN employees e ON s.sal_emp_id = e.emp_id ", "JOIN office_branches o ON e.emp_obr_id = o.obr_id ", "WHERE o.obr_name = '", input$selBranch, "' GROUP BY DATE_TRUNC('MONTH', s.sal_date)::DATE ", "ORDER BY 1" ) ) data.frame(q) }) # Display table data output$table2023Stats <- renderDT({ datatable( data = data_stats(), filter = "top" ) }) # Display chart data output$plotTimeSeries <- renderPlot({ ggplot(data_graph(), aes(x = period, y = total_sold)) + geom_line(color = "#0099f9", size = 2) + geom_point(color = "#0099f9", size = 5) + geom_label( aes(label = total_sold), nudge_x = 0.25, nudge_y = 0.25, check_overlap = TRUE ) + theme_classic() }) <br> # Close connection on app exit session$onSessionEnded(function() { dbDisconnect(conn = conn) print("Database connection closed.") }) } <br> shinyApp(ui = ui, server = server)</code></pre> Here's the R Shiny app you should see: <img class="size-full wp-image-18322" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01b8cc32d6f6c44c21443_24.gif" alt="Image 24 - R Shiny dashboard leveraging normalized database tables" width="1506" height="946" /> Image 24 - R Shiny dashboard leveraging normalized database tables That was easy, right? Up next, you'll learn how to get rid of the sheer code volume produced by the SQL statements. <h2 id="views">Preparing Data in Advance - A Must If You Want to Build Great R Shiny Apps</h2> SQL Views will allow us to drastically reduce the amount of SQL code we have to write in R/R Shiny. They're also a great way for data engineers to prepare data for you, especially if SQL isn't where you shine. But first, <b>what are Views?</b> You can think of views as virtual tables that aggregate data from a table or set of tables and update automatically as the data in the mentioned tables changes. For example, you can wrap the entire logic of the previous two SQL statements in a view, and then run a <code>SELECT</code> command on that view. Let's see how it works in practice. <h3>Creating Views in PostgreSQL</h3> Our first query summarizes sale statistics for all employees in a branch from January 1st, 2023 onwards. We'll remove the branch filter condition from the SQL code since we want the view to contain data for all branches, and then we'll filter them out in R Shiny. You can create a view (or replace an existing one) in PostgreSQL by running a <code>CREATE OR REPLACE VIEW <view-name></code> command, and then pasting a <code>SELECT</code> statement afterward. For example, here's the code required to create a view named <code>v_emp_stats_2023</code>: <pre><code class="language-sql">CREATE OR REPLACE VIEW v_emp_stats_2023 AS ( SELECT e.emp_first_name || ' ' || e.emp_last_name AS employee_name, o.obr_name AS office_branch, SUM(s.sal_total_price) AS amount_sold FROM sales s JOIN employees e ON s.sal_emp_id = e.emp_id JOIN office_branches o ON e.emp_obr_id = o.obr_id WHERE s.sal_date >= TO_DATE('2023-01-01', 'YYYY-MM-DD') GROUP BY e.emp_first_name || ' ' || e.emp_last_name, o.obr_name ORDER BY 3 DESC ); <br>SELECT * FROM v_emp_stats_2023;</code></pre> And here's what it contains: <img class="size-full wp-image-18324" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b2a0803230f605361ba769_25.webp" alt="Image 25 - Contents of a SQL View (1)" width="2862" height="1846" /> Image 25 - Contents of a SQL View (1) Try creating the second view - <code>v_total_sales_by_month</code> without looking at the code snippet below. It will be a nice and simple challenge to make sure you've grasped the concept. Can't figure it out or just want to verify? Here's our code: <pre><code class="language-sql">CREATE OR REPLACE VIEW v_total_sales_by_month AS ( SELECT DATE_TRUNC('MONTH', s.sal_date)::DATE AS period, o.obr_name AS office_branch, SUM(s.sal_total_price) AS total_sold FROM sales s JOIN employees e ON s.sal_emp_id = e.emp_id JOIN office_branches o ON e.emp_obr_id = o.obr_id GROUP BY DATE_TRUNC('MONTH', s.sal_date)::DATE, o.obr_name ORDER BY 1, 2 ); <br>SELECT * FROM v_total_sales_by_month;</code></pre> <img class="size-full wp-image-18326" src="https://webflow-prod-assets.s3.amazonaws.com/6525256482c9e9a06c7a9d3c%2F65b01b8f314317dbcb9545fb_26.webp" alt="Image 26 - Contents of a SQL View (2)" width="2862" height="1846" /> Image 26 - Contents of a SQL View (2) We now have the views created, so let's make appropriate changes to the R Shiny application. <h3>Changes in R Shiny App Structure</h3> You'll only have to change the bits of the <code>server</code> logic. You'll want to select everything from the view and add a filter condition, so only the records from a currently selected branch are displayed. Feel free to copy the code, or try to implement the changes on your own - it'll be a good SQL practice: <pre><code class="language-r">library(DBI) library(DT) library(ggplot2) library(shiny) <br># Global database connection conn <- dbConnect( RPostgres::Postgres(), dbname = "postgres", host = "", port = "", user = "", password = "" ) <br> ui <- fluidPage( sidebarLayout( sidebarPanel( tags$h3("Shiny PostgreSQL"), tags$br(), selectInput(inputId = "selBranch", label = "Branch:", choices = c("New York", "San Francisco"), selected = "New York") ), mainPanel( # Table for the first query tags$div( tags$h4("2023 Statistics for a Selected Branch"), DTOutput(outputId = "table2023Stats"), ), # Chart for the second query tags$div( tags$h4("Sales by Month for a Selected Branch"), plotOutput(outputId = "plotTimeSeries") ) ) ) ) <br> server <- function(input, output, session) { # Get table data data_stats <- reactive({ q <- dbGetQuery( conn = conn, statement = paste0( "SELECT * FROM v_emp_stats_2023 WHERE office_branch = '", input$selBranch, "'" ) ) data.frame(q) }) # Get chart data data_graph <- reactive({ q <- dbGetQuery( conn = conn, statement = paste0( "SELECT period, total_sold FROM v_total_sales_by_month WHERE office_branch = '", input$selBranch, "'" ) ) data.frame(q) }) # Display table data output$table2023Stats <- renderDT({ datatable( data = data_stats(), filter = "top" ) }) # Display chart data output$plotTimeSeries <- renderPlot({ ggplot(data_graph(), aes(x = period, y = total_sold)) + geom_line(color = "#0099f9", size = 2) + geom_point(color = "#0099f9", size = 5) + geom_label( aes(label = total_sold), nudge_x = 0.25, nudge_y = 0.25, check_overlap = TRUE ) + theme_classic() }) # Close connection on app exit session$onSessionEnded(function() { dbDisconnect(conn = conn) print("Database connection closed.") }) } <br> shinyApp(ui = ui, server = server)</code></pre> The dashboard looks identical to what we had previously, so there's no need to show it again. The code blocks are much shorter and easier to read now, though. That's all for the tips we have for you today. Let's make a short recap next. <hr /> <h2 id="summary">Summing Up How to Build Great R Shiny Apps</h2> To summarize, applications and dashboards can be slow. If that's the case, your users won't like it, and you're likely to end up losing a bunch of them. That being said, oftentimes it has nothing to do with the programming language, tool, or technology you write the dashboard in, but is instead related to poor database performance and management. We hope our article has helped you shed some light on what could possibly go wrong on the database end. It's easy to dump everything you have in a single flat table and call it a day, but that approach is likely to have a huge negative impact on your app performance. Spend some time learning concepts such as table indexes and database normalization, or at least speak with the database expert in your company. <i>Have you experienced major bottlenecks in R Shiny applications? If so, were they related to the quality of the data storage solution?</i> Please let us know how you've fixed the problem - <a href="http://twitter.com/appsilon" target="_blank" rel="noopener">@appsilon</a>. <blockquote>Want to learn more about R Shiny and make a career out of it? <a href="https://appsilon.com/how-to-start-a-career-as-an-r-shiny-developer/" target="_blank" rel="noopener">Here's everything you need to go from zero to hired</a>.</blockquote>

Have questions or insights?

Engage with experts, share ideas and take your data journey to the next level!

Stop Struggling with Outdated Clinical Data Systems

Join pharma data leaders from Jazz Pharmaceuticals and Novo Nordisk in our live podcast episode as they share what really works when building modern, compliant Statistical Computing Environments (SCEs).

Save My Spot

Is Your Software GxP Compliant?

Download a checklist designed for clinical managers in data departments to make sure that software meets requirements for FDA and EMA submissions.

Get the Checklist

Ensure Your R and Python Code Meets FDA and EMA Standards

A comprehensive diagnosis of your R and Python software and computing environment compliance with actionable recommendations and areas for improvement.

Book the Audit