The issue of character sets and how this is handled by CSP has come up a few
times and so I decided to describe the logic behind the CSP character set choices,
what the options are and how it works. Hopefully this will be useful for you.
Each CSP page served Caché will include a set of HTTP headers that informs
the browser about the contents of the page. The definition of these headers is contained
in RFC 1945 (for the HTTP/1.0 spec which we currently support in CSP). These headers
are not displayed in the browser and most browsers in fact do not give you a way of
showing them at all. A good way of finding out what the headers are is to use the
built in
%Net.HttpRequest object in Caché. For example
to find the headers from a typical CSP page:
Set http = ##class(%Net.HttpRequest).%New()
Set http.Server = "127.0.0.1"
Set http.Port = 1972
Do http.Head("/csp/samples/loop.csp")
Do http.HttpResponse.OutputToDevice()
Do http.%Close()
And it gives the following output (on a Unicode Caché system):
HTTP/1.0 200 OK
CACHE-CONTROL: no-cache
CONNECTION: Close
CONTENT-TYPE: text/html; charset=utf-8
DATE: Mon, 13 Aug 2001 17:34:18 GMT
EXPIRES: Thu, 29 Oct 1998 17:04:19 GMT
PRAGMA: no-cache
SET-COOKIE: CSPSESSIONID=260066677206262462181; path=/;
Note that the Port is set to 1972 to use the built in mini web server in Caché,
although if you have a proper web server setup this can use the default port number
of 80. Another way of displaying the headers a CSP will produce is from Caché
using:
Do ShowPage^%apiCSP("/csp/samples/loop.csp",,,1)
The 'Content-Type' header is always sent and this specifies the type of document
the browser should expect. Normally this is
text/html meaning that
this is an HTML document so it should render it as such, but if it was a JPEG image
then it will be
image/jpeg. If the content type is
text then
you can also specify a character set to use on the same line with the
charset modifier,
for example:
Content-Type: text/html; charset=utf-8
This means that the web page that follows is encoded using utf-8, which is a
mapping of the unicode standard so that ASCII codes 0-127 are mapped to utf-8 as values
0-127. The browser will lookup each character in the charset table specified in the
HTTP header and draw the symbol corresponding to this character in the window.
In CSP it is important to send the correct charset header with your CSP page
so that the browser can display it correctly. If you do not specify a value for the
charset (I will cover how to set this later) then CSP will use the default value,
on a Unicode Caché system this is
utf-8 and on an 8-bit Caché
system this is the system default locale which is set using the
CNLS.EXE utility
(installed in the
CacheSys/Bin directory along with the other
executable files) and can be seen when you run this utility at the top of the
locale tab.
It can also be found by invoking:
Write ^%SYS("LOCALE","CURRENT")
from the command line. To translate the name next to the drop-down dialog into
the external name used in the HTTP header use the function:
Write $$MapExtCharset^%NLS(^%nls("Loc",locale,0))
If you wish to explicitly set the charset for a page rather than using the default
there are several ways of doing this:
Set %response.CharSet="iso-8859-1"
then if the content-type is a
text type (it does not happen for
other content types) it will output the
charset header with this value.
This method is generally not recommended as the input can not be translated, it is
suggested that you use one of the other mechanisms. Use the <csp:content charset="iso-8859-1">
tag in the CSP page. This is matched by a CSP system rule and sets the
CHARSET parameter
of the class that is compiled from this csp page. Then when the page is rendered the
%response.CharSet value
is initialized to the
CHARSET parameter of this page and so this
is the value written out for the charset if the content-type is a 'text' type. Use
the <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> tag
in the <head></head> section of the CSP page. This is picked up by a system
rule and gets converted into the
CHARSET parameter in the class
that is compiled so the end result is exactly the same as option 1.
Create a class that derives from
%CSP.Page and in this class
specify the charset that you wish to use for the parameter
CHARSET.
In the CSP application settings tab in the Caché Configuration Manager set
the superclass for this application to be the new class you have created. Then recompile
all the pages in your CSP application and they will now inherit from this superclass
and so they will automatically pick up this default
CHARSET parameter.
This is a useful way to set the charset for a large number of pages at the same time.
Besides putting the
charset entry in the HTTP header setting
this also effects how Caché translates any characters it writes out to the
browser and how it interprets input it receives from the browser, either as parameters
in the url or from an HTTP post. CSP will always output the HTTP headers using utf-8
because before the browser reads the 'charset' header there is no way it can know
what the charset should be and if the RFC ever allows characters other than ASCII
in the HTTP header the only logical choice is utf-8 as the ASCII characters are all
preserved using this encoding. Once the headers are output CSP then switches to output
the rest of the page in the charset specified in the header using the call:
Write $$SetIO^%NLS(tablename)
This function returns the old table that was in use and changes the current
device to use the new table. This tablename is the Caché internal name of the
table, for example to turn translation off it is
RAW and to use utf-8
it is
UTF8. If you want to find the name of the current table use the
following:
On a Unicode version of Caché if it is using the default charset then
this will force all characters output to be translated to their utf-8 representation.
For an 8-bit Caché if it is using the default which is the locale specified
in Caché then it will not translate the characters at all and so will use the
RAW table.
This is because the characters in the Caché database are already in the default
locale (because that is the meaning of the default locale) and so it would just be
translating from the default table to the default table and so no translation is needed.
If you specify a charset to use then Caché will pick the NLS table so
that the characters stored inside the Caché database will be converted to characters
in this specified charset. To find the name of the internal Caché table from
the charset name use the function:
Write $$MapCharset^%NLS(charset)
If when displaying a page from CSP you get a
Character Set TEST not installed error
message then it means that you are requesting a charset that Caché does have
available. For example if you invoke:
Write $$MapCharset^%NLS("TEST")
then it returns "" because there is no such charset as
TEST.
When you get this error you have two choices, either load the table into Caché
using the
CNLS.EXE utility (or create a table with this name if
we do not supply this table) or change the CSP page to use another charset that is
supported.
Any characters that can not be converted using the specified table are output
as
? or whatever error character is specified in the
CNLS.EXE utility.
So if you start seeing
? appearing in either your web pages or in the
data posted back to Caché then it is likely that there are characters that
were not translated correctly.
When we receive data back from the web browser in the url or in an HTTP post
CSP will convert the characters from the external charset to the Caché default
locale. As the HTTP headers sent by the web browser do not specify which charset the
browser is using (a fairly serious oversight in the HTTP/1.0 and 1.1 specs) we are
forced to make an estimated guess as to which charset to use, that in practice appears
to work well. We assume that the charset sent by the browser is the same as the charset
specified on the page that it is submitting the information to. So if I have a
form.csp page
that uses charset 'iso-8859-1' and it has a form that submits the information to
result.csp which
also uses charset
iso-8859-1 then we will convert the information sent
by the browser from the charset of the
result.csp page, i.e.
iso-8859-1,
into the Caché default locale. From testing the major browsers it seams that
if you supply a page with a particular charset then any information submitted by the
browser from this page is encoded using this same charset. This means that you should
make sure that both the page that is going to submit information to Caché and
the page that receives this information must be using the same charset, however as
entire applications tend to use the same charset this is not a difficult requirement.
CSP converts strings from the external to the internal charset using the
$ZConvert function:
Write $ZConvert(string,"I",table)
Where
string is the string to convert and
table is
the Caché internal table name.
The one last piece to this puzzle is how CSP reads in the csp files from the
filesystem and how this deals with this charset issue. The compiler reads csp files
using the
%FileCharacterStream object. It first checks if the
file is a Unicode file and if so whether it is big or little Endean. This done by
looking at the leading two bytes in the file, if they are $char(255,254) or $char(254,255)
then this is a Unicode file and it reads it in as such. If the file is not a Unicode
file then it uses the default translation setup in
CNLS.EXE for
reading and writing to files from Caché (on the
Locale tab under
the translation type for sequential devices).
So for example assume you have a Unicode Caché system and wish to use
this in Japan where the charset will be Shift-JIS. When the CSP compiler loads the
csp file it will use the default sequential device translation which will be setup
to convert from Shift-JIS to Unicode and so the CSP class compiled in Caché
will be stored internally in Unicode. Using one of the mechanisms described above
you tell CSP to output this page with Shift-JIS charset, so when a request comes in
for this page CSP sets the translation on the TCP/IP device to convert from the Unicode
internal representation into Shift-JIS externally and it will also add the HTTP header
'charset=Shift-JIS' so the browser will interpret the page correctly. If this page
contains a form where some Shift-JIS characters are entered and then submitted the
data sent by the browser will be using the Shift-JIS charset. As the page it is being
submitted to also has a charset of Shift-JIS then CSP will convert from Shift-JIS
back to the internal Unicode representation of this character. Note that there is
a slight logical inconsistency in the <meta http-equiv> method of setting the charset
in a csp file. If you create a html file with a <meta http-equiv> tag in it on
a web server then when this page is output to a browser the browser will see this
<meta http-equiv> tag and interpret the data in the page as being from the charset
specified in this tag, just as if the 'Content-Type' HTTP header had included the
'charset'. So it tells the browser the charset of the html file. However if you put
a <meta http-equiv> tag in a csp file then this will be loaded into Caché
using the default device translation for a sequential file and then when it is output
to the browser it will convert from the Caché default locale to the charset
specified in the <meta http-equiv> tag. So an HTML editing tool that saves a page
with a <meta http-equiv> specifying a charset of 'utf-8' will save the whole page
as utf-8. When Caché loads this page it should translate from utf-8 to the
Caché default internal locale however it just uses the default device translation
instead which will cause problems if the device translation does not match the charset
of the page. Often the charset specified in the <meta http-equiv> is the same as
the default device translation (e.g. both are Shift-JIS on a Japanese system) and
so this will work correctly.
If you wish to get around the translations of the page on output to the browser
and on input from the browser we also provide a parameter of the compiled class called
NOCHARSETCONVERT if
this is set to 1 then CSP will still output the charset in the HTTP
Content-Type header
but it will not set the device translation to convert from the Caché default
locale to this charset and it will not convert any values sent by the browser from
this charset to the Caché default locale. This means that all charset conversion
issues must be handled by your application rather than being done by the CSP engine
itself. You can also set the NOCHARSETCONVERT option in a csp page with the tag <csp:content
nocharsetconvert="1">.
NOCHARSETCONVERT was developed for situations such as Thailand
that have no locale defined and therefore use English default locale but actually
store 8-bit Thai characters in the database.