In an earlier note on interpreting the content of frequency histograms I made a throwaway comment about the extra complexity of interpreting frequency histograms on character-based columns. This note starts to examine some of the complications.
The driving problem behind character columns is that they can get quite large – up to 4,000 bytes – so the content of an “accurate histogram” could become quite large, and Oracle seems to have taken a strategic decision (at some point in history) to minimise this storage. As a result we can see an algorithm that works roughly as follows:
Given this algorithm, we can do an approximate reversal (which will only be needed when the endpoint_actual_value is not available) by formatting the endpoint_value into a hex string, extracting the first six pairs of digits, converting to numeric and applying the chr() function to get a character value. (You’ll have to fiddle with this bit of code to handle multibyte character sets, of course).
With a nice friendly single-byte character code, the first 5 characters will be extracted correctly, and the sixth will be pretty close to the original. Here’s an example (which also includes the logic to convert the endpoint_number into a frequency):
rem
rem How to read a frequency histogram on a character column
rem
select
endpoint_number,
endpoint_number - nvl(prev_endpoint,0) frequency,
hex_val,
chr(to_number(substr(hex_val, 2,2),'XX')) ||
chr(to_number(substr(hex_val, 4,2),'XX')) ||
chr(to_number(substr(hex_val, 6,2),'XX')) ||
chr(to_number(substr(hex_val, 8,2),'XX')) ||
chr(to_number(substr(hex_val,10,2),'XX')) ||
chr(to_number(substr(hex_val,12,2),'XX')),
endpoint_actual_value
from (
select
endpoint_number,
lag(endpoint_number,1) over(
order by endpoint_number
) prev_endpoint,
to_char(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')hex_val,
endpoint_actual_value
from
dba_tab_histograms
where
owner = 'XXX'
and table_name = 'YYY'
and column_name = 'STATUS_COLUMN'
)
order by
endpoint_number
;
set doc off
doc
ENDPOINT_NUMBER FREQUENCY HEX_VAL CHR(TO ENDPOINT_ACTUAL_VALUE
--------------- ---------- ------------------------------- ------ ------------------------------------------
40254 40254 434C4F534543E9175A7D6A7DC00000 CLOSEC CLOSED
40467 213 434F4E4649524E7E0D374A58200000 CONFIR CONFIRMED
40592 125 44454C49564550D642CA2965000000 DELIVE DELIVERED
41304 712 494E564F49432991BF41C99E800000 INVOIC INVOICED
41336 32 4E4556FFFFFFF1D5FBDBC624E00000 NEVÿÿÿ NEW
41434 98 5041494400000C08C1A415AD800000 PAID PAID
41435 1 5041594D454E5B08040F761BE00000 PAYMEN PAYMENT OVERDUE
41478 43 5049434B4544013F0FF93F6EC00000 PICKED PICKED
41479 1 524546554E4436441DE2A321000000 REFUND REFUND MADE
41480 1 524546554E4436441DE2A321000000 REFUND REFUND PENDING
41482 2 52455455524E2F6693F753B6C00000 RETURN RETURNED
11 rows selected.
#
You’ll notice from the sample output that “REFUND MADE” and “REFUND PENDING” are identical in their numeric representation, and that’s why all the actual values have been stored. You can also see how rounding problems have converted CLOSED to CLOSEC, and the padding applied to short strings (combined with rounding errors) has converted NEW to NEVÿÿÿ.
There are a number of side effects to the 6 bytes / 32 character limits that Oracle has imposed for histograms – and I’ll pick up a couple of those in further posts.
Footnote: It’s interesting to note that space utilisation isn’t considered a threat in 11g when looking at the ‘synopsis’ approach of creating the ‘approximate NDV’ for columns. The difference may be due to the passage of time, of course, on the other hand the threat from synopses is largely limited to disc space whereas histograms have to take up memory (in the dictionary cache / row cache) whenever they are used.
Recent comments
17 weeks 20 hours ago
26 weeks 6 days ago
28 weeks 4 days ago
31 weeks 5 days ago
34 weeks 8 hours ago
43 weeks 4 days ago
45 weeks 22 hours ago
46 weeks 23 hours ago
46 weeks 2 days ago
49 weeks 3 hours ago