John Gerhardt, Author at The Contactually Blog | For Relationship-Based Businesses

Migrating Data From MongoDB to PostgreSQL with No Downtime

Photo by Gareth Davies on Unsplash Why Migrate? In the early days of Contactually, before the product was fully fleshed out and requirements were clear, MongoDB was chosen to house certain portions of our applications data. MongoDB has many valid use-cases, but ours did not fit, and a change needed to be made. In this case, that meant migrating data from our MongoDB cluster into our relational datastore, PostgreSQL. While conceptually this migration is easy to understand, our transition would have failed if it weren’t for this 3-step deployment plan. Important Not to Disrupt Regular Product Usage As we discussed the primary goals for the migration, our number one priority was to make sure that users would suffer neither downtime nor degradation of the product. Additionally, we wanted to make sure that any changes during the migration would not be missed. During a few white-boarding sessions, we came up with a strategy that allowed us to accomplish these main goals. High Level Strategy — Duplicate Writes with a Series of Deployments So — how did we do it? Our high level strategy was to write to both systems simultaneously, with a series of deployments to eventually start reading from the new system. Create New PostgreSQL Schema The first step was to compare the schema (or most recent schema) of the Mongoid class that we intended on migrating, and come up with an appropriate PostgreSQL table schema. In some cases, this wasn’t too disruptive as we were using PostgreSQL v9.6.x, so JSON columns were an option when strings, integers, datetime, or text columns wouldn’t suffice. In many cases, this was also a chance for us to clean up the older...

Adding Read Replicas in a Production Ruby on Rails System with Zero Downtime Using Makara

Thaddaeus Lim When query, schema and index optimizations aren’t enough to support the load of a high throughput database system, fundamental changes in how data flows through that system are often necessary. Over the past few years, our team at Contactually has been fortunate enough to tackle numerous database performance issues, ranging from autovacuum tuning to full production database upgrades. Growth as a company doesn’t occur without these types of issues and we’re happy to share our experiences with our peers. Why Splitting Your Read/Write Queries is Necessary Load balancing is a fundamental concept in any system design. It’s important to design your data systems to handle the peaks or spikes in usage, as opposed to designing to handle the average load a system experience. By splitting queries between your primary and follower databases, you’re able to ensure that users interacting with your web application have a snappy and performant experience, while saving heavy background jobs for a follower. How We Researched Potential Solutions There are a few options when it comes to choosing a database adapter that supports primary/follower query balancing. After evaluating Makara and octopus, we decided to go with Makara for the following reasons: Thread-safety: we use Sidekiq for our background job processing, so a non-thread-safe database adapter was a deal-breaker. Community support / used in large production environments. We read about a few companies experience using Makara, namely Instacart. In our research, we discovered a gem called distribute_reads (made by Instacart), that sealed the deal for us. It also fed directly into our ideal incremental release strategy. Plan a Detailed Roll Out Strategy Given the importance of...

Tackling Architectural Debt: How We Replaced a Production Elasticsearch Cluster

Photo by Alice Pasqual As the quantity and complexity of application data scales with any burgeoning startup, we all run into performance and scalability issues. Some issues can be addressed with slight adjustments to existing infrastructure, while others require fundamental changes to the system’s architecture. At Contactually, one such instance of this was dealing with an unhealthy Elasticsearch cluster that suffered from availability issues, syncing issues, and fundamental inefficiencies of data flow through our systems. This article will focus primarily on the large architectural change we made at Contactually in June of 2016 to reinforce an unreliable Elasticsearch cluster. Why We Use Elasticsearch at Contactually Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data so you can discover the expected and uncover the unexpected. At Contactually, we use Elasticsearch to allow our users to quickly filter their contacts by numerous search criteria such as name, email address, mailing address, phone number, zip code, and even custom fields the user has created. Speed is key here — many of our users have hundreds of thousands of contacts. We store billions of data points in our normalized, relational data model. I won’t go into the pros and cons of normalization in this article, but they are very important to consider when designing any high-throughput system. Denormalized document storage Imagine trying to find a record in your database by attributes that could be in any of 10 tables. This might require an extremely expensive join, especially if the scale of the data you’re working with is...

Postgres at Scale: Query Performance and Autovacuuming for Large Tables

Photo by Yu-chuan Hsu There are few large, hard to solve problems that keep your typical software engineer up at night — and a malfunctioning database is absolutely one of them. We’ll walk you through how our team at Contactually discovered the issue of query performance slippage on some of our largest tables (200 mil — 3 billion+ records) and what we did to address it. The Problem: Systematic Degraded Query Performance In early 2016, we noticed query performance on our largest tables were starting to slip in NewRelic. Over the course of a few weeks, queries that were once taking 10ms were now taking upwards of 6 seconds or more. The team considered numerous causes from poorly written queries to expensive background jobs that may be causing systematic performance issues. Eventually, with the aid of NewRelic, Amazon RDS logs, and hundreds of EXPLAIN queries, we came to the conclusion that this was a database level issue, not poorly written queries in our application (a very common cause of poor database performance). The Culprit: Default Autovacuum Settings After days of research, we narrowed in on the cause: the default autovacuum settings for large tables. Photo by Clem Onojeghuo The default settings can cause weeks to elapse without triggering an autovacuuming process to run on a large table. When autovacuuming doesn’t occur, the query planner is using outdated, incorrect data to decide how to most efficiently execute a query. Imagine trying to tell a friend where the milk is in a grocery store using directions from what the store looked like 5 years ago — you’re friend is going to waste a lot of time using your outdated directions. Critical...