Top 60 Oracle Blogs

Recent comments

Importing geo-partitioned data… the easy way


setting the stage

I started at Cockroach labs back in June 2019 to help others learn how to architect and develop applications using a geo-distributed database.  There has been a resurgence in distributed database technology, but the focus on geo-distributed is quite unique to CockroachDB.  While the underlying technology is unique, developers and DBAs that come with a wealth of experience, need to know how to best use this innovative technology.  Given this situation, I thought it would be good to start a blog series to explore various topics facing anyone beginning to architect database solutions with CockroachDB.

To start using a database, the first step is to IMPORT table data so you can begin to see how the database performs and responds.  And thus the IMPORT series has started!

The bulk of my personal experience with databases not surprisingly comes from work done with Oracle.  To that point, I was working with a customer to show them how to import Oracle data.  I extracted data from the Oracle SwingBench benchmark and created CSV files using the examples in the CockroachDB documentation.  This dataset can be easily replicated using the wonderful toolkit created by Dominic Giles.

import methodology

There are several ways to import data with CockroachDB.   With the IMPORT INTO command, you can import CSV data into an existing table.  The IMPORT TABLE  command imports an entire table from CSV files in one step creating the primary key and secondary indexes in parallel.  For this example, the entire table will be imported via CSV files.

As you might expect with a distributed database, there needs to be an efficient way to for each of the nodes to access the CSV files.  CockroachDB supports the ability to access files stored in cloud storage buckets such as Amazon S3, Azure, and Google Cloud.  Additionally, CockroachDB allows for data to be imported from various other URL data services:  http, nfs, and s3 compatible services.  This blog explores the use of HTTP and NFS/local import methods with the following commands:

and using NFS/nodelocal:

To support NFS mounts, CockroachDB uses --external-io-dir option when starting each node in the cluster.  This points to the NFS mount directory allowing each node to use the nodelocal URL to define the location of the import files.

geo-distributed vs local clusters

The clusters configured for this blog use Google Cloud n1-standard-4 machine type.  Using these machine types, two clusters were setup:

    1. Local Cluster with 3 nodes in a single region
      Screen Shot 2019-12-06 at 4.23.37 PM 300w, 150w" sizes="(max-width: 167px) 100vw, 167px" data-recalc-dims="1" />
    2. Geo-partitioned Cluster with 3 nodes across:
          'us-west1-b', 'us-central1-b', 'us-east1-b'
      Screen Shot 2019-12-06 at 4.22.57 PM 300w, 1024w, 768w" sizes="(max-width: 449px) 100vw, 449px" data-recalc-dims="1" />

Local clusters within one region give superb availability and scalability within a single region.  We have many customers that have moved from Postgres to CockroachDB for a more resilient and scaleable solution.   Often, they have local clusters for test and some production while they explore how to best use geo-partitioning.

measuring the throughput

CockroachDB has the ability to ingest raw or compressed CSV files.  There are certainly good reasons to load both file types with various implications.  Typically, you would expect there to be more CPU overhead to decompress while less impact to transferring data due to the increased payload.

To house the CSV data, a HTTP server was configured on a separate machine.  Additionally, to simulate and optimal NFS environment where all data was locally cached, a copy of all the files were placed in the /tmp/importdir directory on each of the nodes.  Finally, the cluster was started with flag --external-io-dir=/tmp.  This allowed for data to be loaded with the nodelocal URL method.

Screen Shot 2019-12-06 at 3.20.08 PM 300w, 1024w, 768w, 1536w" sizes="(max-width: 1000px) 100vw, 1000px" data-recalc-dims="1" />

Data locality is important to achieving the best results.  Speed of light and the network lag is very much in-play.  That said, the geo-distributed cluster only takes a small ~9% hit in throughput vs the local cluster.

If you don’t have a real NFS setup, you can achieve excellent results as well by loading data to one of the nodes and use the following nodelocal://1/importdir/cust01.csv.gz to load from a specific node.   

Screen Shot 2019-12-06 at 3.39.58 PM 300w, 1024w, 768w" sizes="(max-width: 464px) 100vw, 464px" data-recalc-dims="1" />

These tests were re-run while hosting the files on only one of the cluster nodes.  CockroachDB was able to achieve more than 95% the throughput of a true optimal NFS configuration using nodelocal.


CockroachDB provides multiple cloud and URL methods to import data.  Optimized loading of geo-partitioned data allows for developers to focus on application optimization without worrying about how the data will be loaded. 

I plan to expand on loading and data ingestion with future posts, so please let me know if you have any suggestions or feedback.