Vertica database servers expect to receive all data in UTF-8 and Vertica outputs all data in UTF-8. However, you can load or insert data into Vertica that is non UTF-8, but you’ll want to clean it up.
I used to recommend the REGEXP_REPLACE function for that task, but now there is a better way!
Vertica 10.1.x introduces the MAKEUTF8 built-in function that coerces a string to UTF-8 by removing or replacing non-UTF-8 characters.
The old way of removing non-UTF-8 characters:
dbadmin=> \d nonutf_test;
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
--------+-------------+--------+--------------+------+---------+----------+-------------+-------------
public | nonutf_test | c | varchar(100) | 100 | | f | f |
(1 row)
dbadmin=> INSERT INTO nonutf_test SELECT E'\xa0' || 'Test!' || E'\xa0';
OUTPUT
--------
1
(1 row)
dbadmin=> SELECT c, REGEXP_REPLACE(c, '[^\t\r\n\x20-\x7E]+', '', 1, 0, 'b') FROM nonutf_test ;
c | REGEXP_REPLACE
---------+----------------
▒Test!▒ | Test!
(1 row)
dbadmin=> SELECT * FROM nonutf_test;
c
---------
▒Test!▒
(1 row)
dbadmin=> UPDATE /*+ DIRECT */ nonutf_test SET c = REGEXP_REPLACE(c, '[^\t\r\n\x20-\x7E]+', '', 1, 0, 'b');
OUTPUT
--------
1
(1 row)
dbadmin=> SELECT * FROM nonutf_test;
c
-------
Test!
(1 row)
Now the new and improved way:
verticademos=> TRUNCATE TABLE nonutf_test;
TRUNCATE TABLE
verticademos=> INSERT INTO nonutf_test SELECT E'\xa0' || 'Test!' || E'\xa0';
OUTPUT
--------
1
(1 row)
verticademos=> SELECT c, makeutf8(c) FROM nonutf_test;
c | makeutf8
---------+----------
▒Test!▒ | Test!
(1 row)
verticademos=> SELECT c, makeutf8(c) "Removed non-UTF8 Junk!" FROM nonutf_test;
c | Removed non-UTF8 Junk!
---------+------------------------
▒Test!▒ | Test!
(1 row)
verticademos=> UPDATE nonutf_test SET c = makeutf8(c);
OUTPUT
--------
1
(1 row)
verticademos=> SELECT * FROM nonutf_test;
c
-------
Test!
(1 row)