Reply to MySQL 5.0, FULL-TEXT Indexing and Search Arabic Data, Unicode — PHP Programming Language

Posted by jrs_14618@yahoo.com on 06/13/06 16:33

Hello All,

This post is essentially a reply a previous post/thread
here on this mailing.database.myodbc group titled:

MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode

[This version has a couple subtle edits from the orginial I posted
on mailing.database.myodbc - I'm cross posting here on this
topic/subject related newsgroup]

I was wondering if anybody has experienced the same issues
challenges I'm experiencing I'll describe shortly. Once
resolved some fascinating and powerful multi-lingual
apps incorporating non-English/latin character sets can be
realized by many developers.

I have a Unicode utf8 English - Arabic - Hebrew - Greek (and
several other languages) database in Microsoft Excel. I KNOW
that it is Unicode utf8 data because MySQL tells me it
recognizes the encoding as such but not in the context I want.

Allow me to explain ...

I can search the Unicode utf8 encoding with no problem in
Excel. While in Excel I highlight a complete word or a
partial string of an Arabic word copy it to the clipboard
(i.e. memory). I then do a find and the process is the
same successful result as if it was an English string.

MySQL 5.0 is supposed to handle Unicode utf8

I created a MySQL database I named: languages

CREATE DATABASE languages ;

and I implemented the following command on a MySQL
command prompt:

ALTER DATABASE languages DEFAULT CHARACTER SET utf8;

No problem (so far) MySQL seemingly recognized utf8 and
accepted it. My understanding is with the ALTER command
the tables I create against languages will be utf8.

I now created a table I named mainlang which denotes it
will be the main table for my languages.

mysql>CREATE TABLE mainlang
->(
->langNumID varchar(30),
->colB varchar(30),
->colC varchar(30),
->primary key (langNumID, colB)
->);

Again so far no problem: Table successfully created.
My third column 'colC' is where the Unicode data
will be stored.

I now attempt to import the database from my
Excel file into my MySQL database as follows:

mysql>load data infile 'c:\\arabicdictionary.csv'
->into table mainlang
->fields terminated by ','
->lines terminated by '\n'
->(langNumID, colB, colC);
ERROR 1406 (22001): Data too long for 'colC' at row 1

So what to do? I did a search and found other
people seemingly had the same problem and someone
suggested:

ALTER DATABASE languages DEFAULT CHARACTER SET cp1250;

I dropped mainlang, recreated it, redid the load and
Lo and behold ... it seemed to work. No Data too long
error occurred and when I did the following query:

mysql>select langNumID, colB, colC
->from mainlang
->where colB = '4994';

I see colA have a correct numeric value, colB a
correct numeric value (4994) and for colC a string of
unintelligible characters with diacritical marks,
oomlats etc. which I know is the cp1250 encoding
interpretation of the Unicode utf8 data which is
similarly unintelligible in its own regard.

Now what I try is: do a copy of the obscure colC
cp1250 character string into the clipboard/memory
and then do the following tweak on the original
select statement to see if I can search on the
(now) cp1250 character string:

mysql>select langNumID, colB, colC
->from mainlang
->where colc = 'paste of the cp1250 character string';

The computer would not allow a paste unless I pressed
the escape key. On initiating this select command
I got an empty set (no match)

My questions are:

Has anyone been successful creating a Unicode utf8
MySQL database that accepts Arabic?

If yes, how did you get around or not encounter the
Data too long issue?

Have you tried the cp1250 (or cp1251 - same mechanics
same results) work around as I have? Are you
able to search the cp1250 character string (my colC)?
If yes, how did you successfully manage to do it?

Lastly, if I take the cp1250 encoded string and paste
it into Excel ... I can string search the cp1250
encoding with no problem.

Also, here's how I know my Unicode utf-8 data is
correct apart from my own manual cross-referencing
and being recognized by MySQL in some respect:

When I copy the Unicode utf8 encoding and try to
paste it into the select command to see what would
happen I get the following error:

ERROR 1257 (HY000): Illegal mix of collations
(cp1250_general_ci, IMPLICIT) and
(utf8_general_ci, COERCIBLE) for operation '='

So what I have here is a situation where MySQL
is recognizing Unicode utf8 encoding but not
from the respect of packing a table!

Go Figure ...

Anyone wishing to share any insight or solution would
be GREATLY appeciated. I promise if I find a solution
I will share it.

Thank you Very Much, Shukran Jiddan, Todah Rabah,
Muchos Gracias ...

Joel S
(585) 255-0997
jrs_14618 at yahoo.com

[Back to original message]

Удаленная работа для программистов • Как заработать на Google AdSense • England, UK • статьи на английском • PHP MySQL CMS Apache Oscommerce • Online Business Knowledge Base • DVD MP3 AVI MP4 players codecs conversion help

Home • Search • Site Map • Set as Homepage • Add to Favourites

Сайт изготовлен в Студии Валентина Петручека —
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация