Search

Top 60 Oracle Blogs

Recent comments

Google Spanner – SQL compatibility

By Franck Pachot

.
I have posted, a long time ago, about Google Spanner (inserting data and no decimal numeric data types) but many things have changed in this area. There is now a NUMERIC data type and many things have improved in this distributed SQL database, improving a bit the SQL compatibility.

gcloud

I can use the Cloud Shell, which is very easy – one click fron the console – but here I’m showing how to install the gcloud CLI temporarily in a OEL environement (I’ve explained in a previous post how the OCI free tier is my preferred home)


curl https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-... | tar -C /var/tmp -zxf -
/var/tmp/google-cloud-sdk/install.sh --usage-reporting true --screen-reader false --rc-path /tmp/gcloud.tmp --command-completion true --path-update true --override-components

This downloads and unzips the Google Cloud SDK for Linux (there are other options like a YUM repo). I put it in a temporary directory here under /var/tmp. install.sh is interactive or you can supply all information on command line. I don’t want it to update my .bash_profile or .bashrc but want to see what it puts there, so just providing a temporary /tmp/gcloud.sh to have a lookt at it


[opc@a ~]$ /var/tmp/google-cloud-sdk/install.sh --usage-reporting true --screen-reader false --rc-path /tmp/gcloud.sh --command-completion true --path-update true --override-components

Welcome to the Google Cloud SDK!

Your current Cloud SDK version is: 321.0.0
The latest available version is: 321.0.0

┌────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                 Components                                                 │
├───────────────┬──────────────────────────────────────────────────────┬──────────────────────────┬──────────┤
│     Status    │                         Name                         │            ID            │   Size   │
├───────────────┼──────────────────────────────────────────────────────┼──────────────────────────┼──────────┤
...
│ Installed     │ BigQuery Command Line Tool                           │ bq                       │  < 1 MiB │
│ Installed     │ Cloud SDK Core Libraries                             │ core                     │ 15.9 MiB │
│ Installed     │ Cloud Storage Command Line Tool                      │ gsutil                   │  3.5 MiB │
└───────────────┴──────────────────────────────────────────────────────┴──────────────────────────┴──────────┘

[opc@a ~]$ cat /tmp/gcloud.tmp

# The next line updates PATH for the Google Cloud SDK.
if [ -f '/var/tmp/google-cloud-sdk/path.bash.inc' ]; then . '/var/tmp/google-cloud-sdk/path.bash.inc'; fi

# The next line enables shell command completion for gcloud.
if [ -f '/var/tmp/google-cloud-sdk/completion.bash.inc' ]; then . '/var/tmp/google-cloud-sdk/completion.bash.inc'; fi

As you can see, I mentioned nothing for –override-components and the default is gsutil, core and bq

https://blog.dbi-services.com/wp-insides/uploads/sites/2/2020/12/Screens... 293w" sizes="(max-width: 180px) 100vw, 180px" />
This is the most simple cloud CLI I ever seen. Just type: gcloud init
(/var/tmp/google-cloud-sdk/bin/gcloud in my temporary installation) and it gives you an URL where you can get the verification code, using the web console authentication. Then you pick the cloud project and, optionally, a default region and zone. Those default informations are stored in: ~/.config/gcloud/configurations/config_default
and the credentials are in a sqllite database in ~/.config/gcloud/credentials.db

Sample SQL data

Gerald Venzl has recently published some free to use data about the world’s countries, capitals, and currencies. What I like with this data set is that, in addition to providing the CSV, Gerald have managed to provide a unique DDL + DML to create this data in SQL databases. And this works in the most common databases despite the fact that, beyond the SQL standard, data types and syntax is different in each engine.
https://twitter.com/GeraldVenzl/status/1338327177788411905?s=20

Google Spanner has a SQL-like API but I cannot run this without a few changes. But just a few, thanks to many recent improvements. And loading this data set will be the occasion to show those new features.

Primary key

Here is the DDL to create the first table, REGIONS:


/*********************************************/
/***************** REGIONS *******************/
/*********************************************/
CREATE TABLE regions
(
  region_id     VARCHAR(2)   NOT NULL,
  name          VARCHAR(13)  NOT NULL,
  CONSTRAINT regions_pk
    PRIMARY KEY (region_id)
);

In order to run this in Google Spanner, I change VARCHAR to STRING and I move the PRIMARY KEY declaration out of the relational properties:


CREATE TABLE regions
(
  region_id     STRING(2)   NOT NULL,
  name          STRING(13)  NOT NULL
)
    PRIMARY KEY (region_id)
;

It may be surprising to have the PRIMARY KEY declaration at this place but, because Google Spanner is a distributed database, the primary key is also a storage attribute. And it is mandatory as sharding is done by range partitioning on the primary key. Well, I would prefer to stay compatible with SQL and have, if needed, an additional organization clause. But Spanner is one of the first database trying to bring SQL to NoSQL and was originally designed to be used internally (like Google object storage matadata), providing SQL database benefits (SQL, ACID, joins,…) with the same scalability as distributed NoSQL databases. So compatibility with other SQL databases was probably not a priority.

Note that the NOT NULL constraint is allowed and we will see more about columns constraints later.

I have removed the comments because this is not allowed in Google Spanner. I don’t understand that, but remember that it takes its roots in NoSQL where the API calls are embedded in the code, and not in scripts, and the code has its own comments.

Foreign key

The second table is COUNTRIES and a country belongs to a region from the REGION table.


/*********************************************/
/**************** COUNTRIES ******************/
/*********************************************/
CREATE TABLE countries
(
  country_id    VARCHAR(3)     NOT NULL,
  country_code  VARCHAR(2)     NOT NULL,
  name          VARCHAR(100)   NOT NULL,
  official_name VARCHAR(200),
  population    NUMERIC(10),
  area_sq_km    NUMERIC(10,2),
  latitude      NUMERIC(8,5),
  longitude     NUMERIC(8,5),
  timezone      VARCHAR(40),
  region_id     VARCHAR(2)     NOT NULL,
  CONSTRAINT countries_pk
    PRIMARY KEY (country_id),
  CONSTRAINT countries_regions_fk001
    FOREIGN KEY (region_id) REFERENCES regions (region_id)
);

CREATE INDEX countries_regions_fk001 ON countries (region_id);

On the datatypes, I’ll change VARCHAR to STRING and now we have a NUMERIC datatype in Spanner (was only IEEE 754 float) but NUMERIC has a fixed scale and precision (38,9) I’ll probably come back to it in another post. But there are still many limitations with NUMERIC (cannot create index on it, not easy to map when importing,…). It is definitely not a good choice here for areas, latitude and longitude. But I keep it just for DDL compatibility.

I move the PRIMARY KEY definition. But the most important here is actually about not changing the referential constraint at all:


CREATE TABLE countries
(
  country_id    STRING(3)     NOT NULL,
  country_code  STRING(2)     NOT NULL,
  name          STRING(100)   NOT NULL,
  official_name STRING(200),
  population    NUMERIC,
  area_sq_km    NUMERIC,
  latitude      NUMERIC,
  longitude     NUMERIC,
  timezone      STRING(40),
  region_id     STRING(2)     NOT NULL,
  CONSTRAINT countries_regions_fk001
    FOREIGN KEY (region_id) REFERENCES regions (region_id)
)
PRIMARY KEY (country_id)
;

Before March 2020 The only possible referential integrity way actually a storage clause (at the same place as the PRIMARY KEY declaration) because referential integrity is not something easy in a distributed database. Because you distribute to scale and that works well only when your transaction is single-shard. This is why, initially, referential integrity was not a FOREIGN KEY but a compound PRIMARY KEY interleaved with the parent PRIMARY KEY. But we will see INTERLEAVE IN PARENT later. Here, COUTRIES has its own primary key, and then its own sharding scheme. That’ also mean that inserting a new country may have to check the parent key (region) in another shard. We will look at performance in another post.

When we declare a foreign key in Google Spanner, an index on it is created, then I didn’t copy the CREATE INDEX statement.

Check constraints


/*********************************************/
/***************** CITIES ********************/
/*********************************************/

CREATE TABLE cities
(
  city_id       VARCHAR(7)    NOT NULL,
  name          VARCHAR(100)  NOT NULL,
  official_name VARCHAR(200),
  population    NUMERIC(8),
  is_capital    CHAR(1)       DEFAULT 'N' NOT NULL,
  latitude      NUMERIC(8,5),
  longitude     NUMERIC(8,5),
  timezone      VARCHAR(40),
  country_id    VARCHAR(3)    NOT NULL,
  CONSTRAINT cities_pk
    PRIMARY KEY (city_id),
  CONSTRAINT cities_countries_fk001
    FOREIGN KEY (country_id) REFERENCES countries (country_id),
  CONSTRAINT cities_is_capital_Y_N_check001
    CHECK (is_capital IN ('Y','N'))
);

CREATE INDEX cities_countries_fk001 ON cities (country_id);

In addition to the datatypes we have seen earlier I’ll transform CHAR to STRING, but I have to remove the DEFAULT declaration. Spanner recently introduced generated columns but those are always calculated, they cannot substitute to DEFAULT.
We declare check constraints since Oct. 2020 and I keep the same declaration. As far as I know they are not used by the optimizer but only to validate the data ingested.

I have a foreign key to COUNTRIES which I keep as non-interleaved. Because the COUNTRY_ID is not part of the CITIES primary key, and maybe because my data model may have to cope with cities changing to another country (geopolitical immutability).


CREATE TABLE cities
(
  city_id       STRING(7)    NOT NULL,
  name          STRING(100)  NOT NULL,
  official_name STRING(200),
  population    NUMERIC,
  is_capital    STRING(1)       NOT NULL,
  latitude      NUMERIC,
  longitude     NUMERIC,
  timezone      STRING(40),
  country_id    STRING(3)    NOT NULL,
  CONSTRAINT cities_countries_fk001
    FOREIGN KEY (country_id) REFERENCES countries (country_id),
  CONSTRAINT cities_is_capital_Y_N_check001
    CHECK (is_capital IN ('Y','N'))
)
    PRIMARY KEY (city_id)
;

Again, the index on COUNTRY_ID is created implicitely.

Interleaved


/*********************************************/
/***************** CURRENCIES ****************/
/*********************************************/
CREATE TABLE currencies
(
  currency_id       VARCHAR(3)    NOT NULL,
  name              VARCHAR(50)   NOT NULL,
  official_name     VARCHAR(200),
  symbol            VARCHAR(18)   NOT NULL,
  CONSTRAINT currencies_pk
    PRIMARY KEY (currency_id)
);
/*********************************************/
/*********** CURRENCIES_COUNTRIES ************/
/*********************************************/
CREATE TABLE currencies_countries
(
  currency_id    VARCHAR(3)   NOT NULL,
  country_id     VARCHAR(3)   NOT NULL,
  CONSTRAINT currencies_countries_pk
    PRIMARY KEY (currency_id, country_id),
  CONSTRAINT currencies_countries_currencies_fk001
    FOREIGN KEY (currency_id) REFERENCES currencies (currency_id),
  CONSTRAINT currencies_countries_countries_fk002
    FOREIGN KEY (country_id)  REFERENCES countries(country_id)
);

The CURRENCIES_COUNTRIES is the implementation of many-to-many relationship as a country may have multiple currencies and a currency can be used in multiple country. The primary key is a concatenation of the foreign keys. Here I’ll decide that referential integrity is a bit stronger and the list of currencies will be stored in each country, like storing it pre-joined for performance reason. If you come from Oracle, you may see this INTERLEAVE IN PARENT as a table CLUSTER but on a partitioned table. If you come from DynamoDB you may see the of it as the Adjacency lists modeling in the single-table design.


CREATE TABLE currencies
(
  currency_id       STRING(3)    NOT NULL,
  name              STRING(50)   NOT NULL,
  official_name     STRING(200),
  symbol            STRING(18)   NOT NULL,
)
    PRIMARY KEY (currency_id)
;

CREATE TABLE currencies_countries
(
  currency_id    STRING(3)   NOT NULL,
  country_id     STRING(3)   NOT NULL,
  CONSTRAINT currencies_countries_countries_fk002
    FOREIGN KEY (country_id)  REFERENCES countries(country_id)
)
PRIMARY KEY (currency_id, country_id)
,  INTERLEAVE IN PARENT currencies ON DELETE CASCADE;

Here, the many-to-many association between CURRENCIES and COUNTRIES is materialized as a composition with CURRENCIES, stored with it (INTERLEAVE), and removed with it (ON DELETE CASCADE). Note that I did that for the demo because CURRENCY_ID was first in the primary key declaration, but you may think more about data distribution and lifecycle when deciding on interleaving. Google Spanner partitions by range, and this means that all CURRENCIES_COUNTRIES associations will be stored together, in the same shard (called “split” in Spanner) for the same CURRENCY_ID.

Multi-region instance

There are many new multi-region configurations, within the same continent or distributed over multiple ones:


gcloud spanner instance-configs list

NAME                              DISPLAY_NAME
asia1                             Asia (Tokyo/Osaka/Seoul)
eur3                              Europe (Belgium/Netherlands)
eur5                              Europe (London/Belgium/Netherlands)
eur6                              Europe (Netherlands, Frankfurt)
nam-eur-asia1                     United States, Europe, and Asia (Iowa/Oklahoma/Belgium/Taiwan)
nam10                             United States (Iowa/Salt Lake/Oklahoma)
nam11                             United States (Iowa, South Carolina, Oklahoma)
nam3                              United States (Northern Virginia/South Carolina)
nam6                              United States (Iowa/South Carolina/Oregon/Los Angeles)
nam7                              United States (Iowa, Northern Virginia, Oklahoma)
nam8                              United States (Los Angeles, Oregon, Salt Lake City)
nam9                              United States (Northern Virginia, Iowa, South Carolina, Oregon)
regional-asia-east1               asia-east1
regional-asia-east2               asia-east2
regional-asia-northeast1          asia-northeast1
regional-asia-northeast2          asia-northeast2
regional-asia-northeast3          asia-northeast3
regional-asia-south1              asia-south1
regional-asia-southeast1          asia-southeast1
regional-asia-southeast2          asia-southeast2
regional-australia-southeast1     australia-southeast1
regional-europe-north1            europe-north1
regional-europe-west1             europe-west1
regional-europe-west2             europe-west2
regional-europe-west3             europe-west3
regional-europe-west4             europe-west4
regional-europe-west6             europe-west6
regional-northamerica-northeast1  northamerica-northeast1
regional-southamerica-east1       southamerica-east1
regional-us-central1              us-central1
regional-us-east1                 us-east1
regional-us-east4                 us-east4
regional-us-west1                 us-west1
regional-us-west2                 us-west2
regional-us-west3                 us-west3
regional-us-west4                 us-west4

I’ll use the latest dual-region in my continent, eur6, added on Dec. 2020 which has two read-write regions (Netherlands, Frankfurt) and the witness region in Zurich. Yes, this neutral position of Switzerland is a perfect fit in the distributed quorum, isn’t it? </p />
</p></div>

    	  	<div class=