Pity us poor US DBAs — safely secure using our ancient, many-times-upgraded Oracle 6 databases with their US7ASCII character sets.
We knew that ASCII only covered 0-127, but who could blame us when we started putting WE8MSWIN1252 "international" characters into those fields — the database let us, and it felt kind of sexy putting in cool European characters with umlauts and accents on them.
Besides, all of our business was with other American companies, and if someone had some "funny" characters in their name, then they just had to change them!
Of course, all of this is said with tongue firmly planted in cheek. Nowadays you’d better be able to handle Unicode in your database if you want to have a prayer of not being labeled as something older than teleprinters and typewriters.
I first encountered this situation when working with a US7ASCII database where we started using XMLTYPE columns — little did I know that XMLTYPE columns actually validated the character set of the XML document coming in — one of our fields was the country name.
Everything was fine until February 13th, 2004 — the day ISO added an entry for the Aland Islands… (which has an A with a diacritical ring above it).
We started seeing errors inserting our XML documents — all due to strict validation of the character set. Did we change character sets? No — we stopped using the XMLTYPE columns
Fast forward a few years and now I’m lucky enough to work with proper databases created with the AL32UTF8 character set — so now I can store my friend Mogens Noorgard name correctly (or I would if I could spell it…)
However, little did I realize that I needed to declare my columns differently…
You see, back in the day, VARCHAR2(10) meant that I wanted to store up to 10 characters in the column gosh darn it — I didn’t worry about bytes vs. characters — same thing right?
So in a brand new database with an AL32UTF8 character set, why was I getting column length errors trying to insert the string H,U,”Y with an umlaut” into a VARCHAR2(3) field?
Heck, isn’t “Y with an umlaut” just another character? It’s just WESMSWIN1252 character 255, right?
Don’t tell me it’s a character set issue — I’ve been trying to avoid opening up that NLS manual for years…
Ok, ok — open the manual and start reading about Unicode — specifically UTF-8. Uh-oh, I read the words "variable-length encoding" and the light starts to dawn…
Turns out that “Y with an umlaut” is 1 byte in WESMSWIN1252 (specifically 0xFF), but it’s 2 bytes in UTF-8 (0xC3BF).
But didn’t I declare the column to be 3 characters in length? So why does it care about the underlying encoding?
Enter NLSLENGTHSEMANTICS and the fact that the default is set to BYTE.
From the documentation:
Environment variable, initialization parameter, and
Range of values
By default, the character data types
VARCHAR2 are specified in bytes, not characters. Hence, the specification
CHAR(20) in a table definition allows 20 bytes for storing character data.
This works well if the database character set uses a single-byte character encoding scheme because the number of characters is the same as the number of bytes. If the database character set uses a multibyte character encoding scheme, then the number of bytes no longer equals the number of characters because a character can consist of one or more bytes. Thus, column widths must be chosen with care to allow for the maximum possible number of bytes for a given number of characters. You can overcome this problem by switching to character semantics when defining the column size.
NLSLENGTHSEMANTICS enables you to create
LONG columns using either byte or character length semantics.
NCLOB columns are always character-based. Existing columns are not affected.
You may be required to use byte semantics in order to maintain compatibility with existing applications.
NLSLENGTHSEMANTICS does not apply to tables in
SYSTEM. The data dictionary always uses byte semantics.
Note that if the
NLSLENGTHSEMANTICS environment variable is not set on the client, then the client session defaults to the value for
NLSLENGTHSEMANTICS on the database server. This enables all client sessions on the network to have the same
NLSLENGTH_SEMANTICS behavior. Setting the environment variable on an individual client enables the server initialization parameter to be overridden for that client.
Can anyone tell me why the default would be BYTE? Why would I want to declare character fields with BYTE lengths? Thank goodness it’s not in bits…
Anyway, we adjusted our standard to make sure that DDL always specifies BYTE or CHAR in the declaration now:
VARCHAR2(10 CHAR) instead of VARCHAR2(10), so now we can be sure…