You are here: Re: Encoding/characterset/font family confusion « PHP Programming Language « IT news, forums, messages
Re: Encoding/characterset/font family confusion

Posted by Umberto Salsi on 03/30/07 14:42

Erwin Moller <since_humans_read_this_I_am_spammed_too_much@spamyourself.com> wrote:

> Hi group,
>
> I could use a bit of guidance on the following matter.
>
> I am starting a new project now and must make some decisions regarding
> encoding.
> Environment: PHP4.3, Postgres7.4.3

Ok for PostgreSQL, but since you are starting a new project better to use
PHP 5.

> I must be able to receive forminformation and store that in a database and
> later produce it on screen on the client (just plain HTML).
> Nothing special. I do this for many years, but I never paid a lot of
> attention to special characters.
>
> A few day ago I discovered that the euro-sign is not defined in all
> fontfamilies.
> They cannot produce the right sign no matter if I use &euro; or the
> hexadecimal equivalent.
> After a little research I found I could put font-tags around the euro-sign
> with another font-family (Arial in this case) to get the Euro sign.
>
> I am completely graphical impaired, and only understand programmingcode (and
> HTML/JavaScript of course) , so this is a weak point on my side, hence this
> question.
>
> I target on Europe only at the moment (no need for Chineese
> charactersupport)
> That said, will the following setup make sense?
>
> Postgresql db encoding scheme: LATIN1
> In the headers of all my HTML: content-type: text/html charset: iso-8859-1

Latin1 (aka ISO-8859-1) does not include the Euro sign.
ISO-8859-15 was updated just to include the Euro sign.
Since more and more countries are joining the european community, Latin1
cannot cover all the writing systems (polish and turkish peoples will
encounter some problems sending their name and address, for example).

UTF-8 (the most used encoding of the UNICODE charset) would be the best
solution, since it includes ALL the charsets currently used in the world,
Euro sign included.

> A few related questions:
> 1) Will people be able to copy/paste info from other sources (like
> wordprocessing programs and other websites) into my forms?

Browsers all internally work in UNICODE: pages are converted from the
encoding of the page (ISO-8859-1 in your case) to UNICODE once received,
then data provided by the user are converted back to the original encoding
of the page (ISO-8859-1) before being sent back to the server; characters
that do not fit that encoding are coded as &HHH; were HHH is the UNICODE
value of the character. Definitely, UTF-8 (the recommended interchange
encoding for the UNICODE charset) is the best choice.

> 2) Can I use regular expressions as I am used to (ASCII) in my PHP code?
> Will I match e acute, eurosign, etc?

preg_*() functions support the /u modifier for UTF-8 strings, required
only if the pattern contains non-ASCII chars.

> 3) Will the roundtrip describe here under have problems with normal expected
> european characters?
>
> client copies some text from some source ->

Good programs and OS should copy the text as UNICODE chars.

> paste in the form ->

Since browsers internally already use UNICODE, no conversion take place here.

> receive by PHP ->

The browser convert the text into the encoding of the page containing the
FORM. PHP handle every string as a sequence of bytes, whatever its charset
or encoding may be.

Every string must be validated:

$s = (string) $_POST['address'];
# Remove ASCII control chars 0-32,127:
$s = preg_replace("/[\\000-\\037\\177]/", "", $s);
# Ensure the UTF-8 encoding; bad sequences are dropped:
$s = mb_convert_encoding($s, 'UTF-8', 'UTF-8');
# Ensure the max length (50 chars):
if( mb_strlen($s, 'UTF-8') > 50 )
$s = mb_strcut($s, 0, 50, 'UTF-8');

> insert in Postgresql (or update) ->

PostgreSQL requires to declare the charset used when the DB is created.
For example

CREATE DATABASE mydb WITH TEMPLATE = template0 ENCODING = 'UNICODE';

will create a new DB where all the text fields are UTF-8, so VARCHAR(50)
might actually store up to 50*6 bytes. The non-standard PostgreSQL type
TEXT is often more convenient, since the manual states that VARCHAR and
TEXT are treated internally exactly in the same way; control for the max
length of every field can be left to WEB interface implemented via PHP as
in the example above.

$db = pg_connect("dbname=mydb");

# The strings we are sending to the DB server are encoded
# as UTF-8; since the DB we created already uses UTF-8, no
# conversion take place between PHP and DB:
pg_set_client_encoding($db, "UTF-8");

pg_query($db, "INSERT INTO sometable (aString) VALUES "
. "'" . pg_escape_string($s) . "')")

> retrieve from postgresql ->

$db = pg_connect("dbname=mydb");
pg_set_client_encoding($db, "UTF-8");
$table = pg_query($db, "SELECT * FORM sometable");

An UTF-8 string is returned.

> display as HTML (with content-type: text/html charset: iso-8859-1)

(there is a missing ";" before "charset")

If the encoding of the DB match that of the page, no conversion is required.

If the string appears as HTML text, apply htmlspecialchars():

echo "Your address is: " . htmlspecialchars($s);

If the string must be inserted inside an attribute, enclose between double
quotes and apply htmlspecialchars():

echo "<input type=text value=\"" . htmlspecialchars($s) . "\" name=xxx>";

If the string must be inserted inside a <textarea> apply htmlspecialchars()
and nl2br():

echo
"Your new address: <textarea name=newaddress>\n", # required \n
nl2br( htmlspecialchars($s) ),
"</textarea>";

> Is that OK?
> Any pitfalls?

Don't try to dereference single chars from an UTF-8 string.
Don't use str*(), always use their mb_str*() counterpart.

Every static HTML page must be UTF-8 encoded and must contain
<meta http-equiv="Content-Type" contents="text/html; charset=UTF-8">

Every PHP page must be UTF-8 encoded and must contain
header("Content-Type: text/html; charset=UTF-8");

> Should I maybe use UTF-8?

Definitively.

Regards,
___
/_|_\ Umberto Salsi
\/_\/ www.icosaedro.it

 

Navigation:

[Reply to this message]


Удаленная работа для программистов  •  Как заработать на Google AdSense  •  England, UK  •  статьи на английском  •  PHP MySQL CMS Apache Oscommerce  •  Online Business Knowledge Base  •  DVD MP3 AVI MP4 players codecs conversion help
Home  •  Search  •  Site Map  •  Set as Homepage  •  Add to Favourites

Copyright © 2005-2006 Powered by Custom PHP Programming

Сайт изготовлен в Студии Валентина Петручека
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация