## Who's online

There are currently 0 users and 29 guests online.

# Frequency Histogram 4

In an earlier note on interpreting the content of frequency histograms I made a throwaway comment about the extra complexity of interpreting frequency histograms on character-based columns. This note starts to examine some of the complications.

The driving problem behind character columns is that they can get quite large – up to 4,000 bytes – so the content of an “accurate histogram” could become quite large, and Oracle seems to have taken a strategic decision (at some point in history) to minimise this storage. As a result we can see an algorithm that works roughly as follows:

• Take the first six bytes of the string (after padding it to 20 characters with nulls (varchar) or spaces (char))
• View this as a hexadecimal number, and convert to decimal
• Round to 15 significant digits and store as the endpoint_value
• If duplicate rows appear, store the first 32 bytes of each string as the endpoint_actual_value

Given this algorithm, we can do an approximate reversal (which will only be needed when the endpoint_actual_value is not available) by formatting the endpoint_value into a hex string, extracting the first six pairs of digits, converting to numeric and applying the chr() function to get a character value. (You’ll have to fiddle with this bit of code to handle multibyte character sets, of course).

With a nice friendly single-byte character code, the first 5 characters will be extracted correctly, and the sixth will be pretty close to the original. Here’s an example (which also includes the logic to convert the endpoint_number into a frequency):

```
rem
rem     How to read a frequency histogram on a character column
rem

select
endpoint_number,
endpoint_number - nvl(prev_endpoint,0)  frequency,
hex_val,
chr(to_number(substr(hex_val, 2,2),'XX')) ||
chr(to_number(substr(hex_val, 4,2),'XX')) ||
chr(to_number(substr(hex_val, 6,2),'XX')) ||
chr(to_number(substr(hex_val, 8,2),'XX')) ||
chr(to_number(substr(hex_val,10,2),'XX')) ||
chr(to_number(substr(hex_val,12,2),'XX')),
endpoint_actual_value
from    (
select
endpoint_number,
lag(endpoint_number,1) over(
order by endpoint_number
)                                                       prev_endpoint,
to_char(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')hex_val,
endpoint_actual_value
from
dba_tab_histograms
where
owner = 'XXX'
and     table_name = 'YYY'
and     column_name = 'STATUS_COLUMN'
)
order by
endpoint_number
;

set doc off
doc

ENDPOINT_NUMBER  FREQUENCY HEX_VAL                         CHR(TO ENDPOINT_ACTUAL_VALUE
--------------- ---------- ------------------------------- ------ ------------------------------------------
40254      40254  434C4F534543E9175A7D6A7DC00000 CLOSEC CLOSED
40467        213  434F4E4649524E7E0D374A58200000 CONFIR CONFIRMED
40592        125  44454C49564550D642CA2965000000 DELIVE DELIVERED
41304        712  494E564F49432991BF41C99E800000 INVOIC INVOICED
41336         32  4E4556FFFFFFF1D5FBDBC624E00000 NEVÿÿÿ NEW
41435          1  5041594D454E5B08040F761BE00000 PAYMEN PAYMENT OVERDUE
41478         43  5049434B4544013F0FF93F6EC00000 PICKED PICKED
41479          1  524546554E4436441DE2A321000000 REFUND REFUND MADE
41480          1  524546554E4436441DE2A321000000 REFUND REFUND PENDING
41482          2  52455455524E2F6693F753B6C00000 RETURN RETURNED

11 rows selected.

#
```

You’ll notice from the sample output that “REFUND MADE” and “REFUND PENDING” are identical in their numeric representation, and that’s why all the actual values have been stored. You can also see how rounding problems have converted CLOSED to CLOSEC, and the padding applied to short strings (combined with rounding errors) has converted NEW to NEVÿÿÿ.

There are a number of side effects to the 6 bytes / 32 character limits that Oracle has imposed for histograms – and I’ll pick up a couple of those in further posts.

Footnote: It’s interesting to note that space utilisation isn’t considered a threat in 11g when looking at the ‘synopsis’ approach of creating the ‘approximate NDV’ for columns. The difference may be due to the passage of time, of course, on the other hand the threat from synopses is largely limited to disc space whereas histograms have to take up memory (in the dictionary cache / row cache) whenever they are used.