You are here: Re: text parsing « PHP Programming Language « IT news, forums, messages
Re: text parsing

Posted by Carolyn Marenger on 01/24/08 10:51

McKirahan wrote:
> "Carolyn Marenger" <cajunk@marenger.com> wrote in message
> news:5a282$4797516b$cf70133e$25433@PRIMUS.CA...
>> McKirahan wrote:
>>> "Carolyn Marenger" <cajunk@marenger.com> wrote in message
>>> news:81d7b$47973b05$cf70133e$360@PRIMUS.CA...
>>>> McKirahan wrote:
>>>>> "Carolyn Marenger" <cajunk@marenger.com> wrote in message
>>>>> news:7c0f$4795ea54$cf70133e$1079@PRIMUS.CA...
>>>>>> McKirahan wrote:
>>>>>>> "Carolyn Marenger" <cajunk@marenger.com> wrote in message
>>>>>>> news:74fb1$479501d1$cf70133e$7458@PRIMUS.CA...
>>>>>>>> Can someone point me in the direction of some good documentation on
>>>>> text
>>>>>>>> parsing?
>>>>>>>>
>>>>>>>> I want to take a bunch of text files (rtf), read them in and dump
> the
>>>>>>>> contents in a database. The files are effectively a flat file
>>>>> database,
>>>>>>>> with I suspect some fairly intricate programming needed to process
>>> the
>>>>>>>> files. Unfortunately, they are laid out for human readability, not
>>>>> data
>>>>>>>> conversion.
>>>>>>> A few questions.
>>>>>>>
>>>>>>> How many is a "bunch"?
>>>>>>> What would the target database be -- MySQL?
>>>>>>> What table and column structures do you envision?
>>>>>>> Perhaps simply a single table with two columns:
>>>>>>> filename (key) and a memo field containing the data?
>>>>>>> What is the purpose behind doing this?
>>>>>>>
>>>>>> A few answers
>>>>>>
>>>>>> A bunch is about a dozen. Basically one large file that was broken
>>> into
>>>>>> sixteen subsets, following the initial letter for each record.
>>>>>>
>>>>>> The target database would be MySQL
>>>>>>
>>>>>> I haven't looked too closely at the data, but I think one main table
>>>>>> with a few linked tables for those cases where there may be more than
>>>>>> one piece of data for a category. There are about 25 categories to
>>> each
>>>>>> record. Eventually there would be additional structure added around
>>> the
>>>>>> imported data, but that isn't relevant to importing the data itself.
>>> (I
>>>>>> will confirm this before beginning to code.
>>>>>>
>>>>>> The purpose: I am a D&D fan and I run games. I would like to be able
>>> to
>>>>>> reference the material and automate much of the process so I don't
> have
>>>>>> to lug and reference 20lbs of books.
>>>>> Any chance the RTF files are online so I could look at them?
>>>>>
>>>>> Perhaps http://www.wizards.com/default.asp?x=d20/article/srd35?
>>>>> http://www.wizards.com/d20/files/v35/SRD.zip contains 88 RTF files.
>>>>>
>>>>>
>>>>> Also, I gather, this might be a one-time effort; correct?
>>>>>
>>>>> Not what you requested but ...
>>>>>
>>>>> I've developed a VBScript solution that takes the following approach:
>>>>> for a given folder, each RTF file is opened in MS-Word and saved
>>>>> as a text file which is opened and read then saved in an MS-Access
>>>>> database table containing 3 columns: id (AutoNumber), file, data.
>>>>>
>>>>> Using those 86 RTF files it created a 10MB MS-Access database.
>>>>>
>>>> Yes, they are online. Yes, you can look at them. Yes, those are the
>>>> files except I only care about the 16 monster files. Yes, this is a
> one
>>>> time effort.
>>>>
>>>> My goal is to create a encounter generation program - where I key in
>>>> climate, geography, season, encounter level, time of day, proximity to
>>>> civilization, and the application gives me a suggested random encounter
>>>> suited to the scenario. For example, if the party was wandering around
>>>> the city sewers on a hot summer night, they might encounter a pack of
>>>> giant rats being led by a were rat. I would then want the program to
>>>> determine how many rats, how many hit points each, and any other
>>>> pertinent variable data, including what weapons and treasure the
> wererat
>>>> was carrying and using.
>>>>
>>>> Having the rtfs loaded into a database like your script does, would
>>>> enable faster searches, it would not go the next step and perform the
>>>> various calculations based on the results of the searches. It is a
> good
>>>> start, but if it has stripped any of the rtf encoding, it may make it
>>>> harder to have a script find the various 'fields'.
>>>>
>>>> Thanks, Carolyn
>>> I counted 17 "Monster" prefixed files.
>>>
>>> My version creates ".txt" files which do strip "the rtf encoding".
>>>
>>> An alternative version creates ".htm" files which retains the
>>> formatting you want; I don't think you really want all of the
>>> "rtf encoding" unless you fully understand the specification:
>>> (search on "rtf specification".)
>>>
>>> Perhaps, as an intermediate step, you would like all of the
>>> "Monster" rtfs converted to HTML and made available via
>>> an interface to open one or more for viewing.
>>>
>>> As HTML files they consume 7.5MB.
>>>
>> There are a couple of the monster prefixed files that are not listings
>> of monsters but other information, such as monsters as characters.
>> Anyway, exact number of files is not overly important.
>>
>> I just did a little test, and looking at the files, I think the easiest
>> to work with may indeed be the text file.
>>
>> Here is an example to illustrate: I am pulling the monster name, type
>> and hit dice from each file format.
>>
>> in rtf...
>> {
>> \par }{\fs36
>> \par DELVER
>> \par }\trowd \trgaph108\trleft-108\trbrdrh\brdrs\brdrw10
>> \trftsWidth1\trautofit1\trpaddl108\trpaddr108\trpaddfl3\trpaddfr3
>> \clvertalt\clbrdrb\brdrs\brdrw10 \cltxlrtb\clftsWidth1
>> \cellx1969\clvertalt\clbrdrb\brdrs\brdrw10
>> \cltxlrtb\clftsWidth3\clwWidth4871
>> \cellx6840\pard \ql \li0\ri0\nowidctlpar\intbl\faauto\rin0\lin0 {\b\fs20
>> }{\b\fs19 \cell }{\fs20 Huge Aberration}{\fs19 \cell }\pard \ql
>> \li0\ri0\widctlpar\intbl\aspalpha\aspnum\faauto\adjustright\rin0\lin0
>> {\fs19 \trowd \trgaph108\trleft-108\trbrdrh
>> \brdrs\brdrw10
>> \trftsWidth1\trautofit1\trpaddl108\trpaddr108\trpaddfl3\trpaddfr3
>> \clvertalt\clbrdrb\brdrs\brdrw10 \cltxlrtb\clftsWidth1
>> \cellx1969\clvertalt\clbrdrb\brdrs\brdrw10
>> \cltxlrtb\clftsWidth3\clwWidth4871 \cellx6840\row }\trowd
>> \trgaph108\trleft-108\trbrdrh\brdrs\brdrw10
>> \trftsWidth1\trautofit1\trpaddl108\trpaddr108\trpaddfl3\trpaddfr3
>> \clvertalt\clbrdrt\brdrs\brdrw10 \clbrdrb\brdrs\brdrw10
>> \cltxlrtb\clftsWidth1 \cellx1969\clvertalt\clbrdrt\brdrs\brdrw10
>> \clbrdrb\brdrs\brdrw10
>> \cltxlrtb\clftsWidth3\clwWidth4871 \cellx6840\pard \ql
>> \li0\ri0\nowidctlpar\intbl\faauto\rin0\lin0 {\b\fs20 Hit Dice:}{\b\fs19
>> \cell }{\fs20 15d8+78 (145 hp)}{\fs19 \cell }\pard \ql
>>
>> ----------
>> in .html...
>> <P STYLE="page-break-after: avoid"><FONT SIZE=5>DARKMANTLE</FONT></P>
>> <TABLE WIDTH=410 BORDER=1 BORDERCOLOR="#000000" CELLPADDING=7
>> CELLSPACING=0 FRAME=VOID RULES=ROWS>
>> <COL WIDTH=124>
>> <COL WIDTH=258>
>> <TR VALIGN=TOP>
>>
>> <TD WIDTH=124>
>> <P CLASS="western">
>> </P>
>> </TD>
>> <TD WIDTH=258>
>> <P CLASS="western"><FONT SIZE=2>Small Magical Beast</FONT></P>
>> </TD>
>> </TR>
>> <TR VALIGN=TOP>
>>
>> <TD WIDTH=124>
>> <P CLASS="western"><FONT SIZE=2><B>Hit Dice:</B></FONT></P>
>> </TD>
>> <TD WIDTH=258>
>> <P CLASS="western"><FONT SIZE=2>1d10+1 (6 hp)</FONT></P>
>> </TD>
>> </TR>
>>
>> ---------
>> in .txt...
>>
>> DARKMANTLE
>>
>> Small Magical Beast
>> Hit Dice:
>> 1d10+1 (6 hp)
>>
>>
>> --------
>>
>> So, looking at that and assuming the rest will be similar, the text
>> version looks the easiest to deal with. If document styling such as
>> 'title', 'heading' and 'subheading' had been used, maybe not, but in
>> this case, a new line seems to denote either a field heading or field
>> data. There are exceptions of course - particularly when denoting a
>> category of monster.
>>
>> That doies bring me a little closer to achievign my goal. Thanks for
>> the assistance. :)
>>
>> Carolyn
>
> So I gather you have what you need.
>
> I'd suggest just manually converting the files (via MS-Word Save-As)
> rather than automating that part of the process since it's a one-time
> effort and there aren't that many files.
>
> Below is a page that will list and allow selection of the "Monster"
> files via a dropdown with the page displayed in an <iframe>. The
> <select> is on the right to allow quicker access to the scroll bar.
> Save it as "Monster.htm" and put it in the same folder as the
> "Monster" files as Web pages; (i.e. with a ".htm" extension).
> Doubleclick on the filename in Windows Explorer or create a
> desktop shortcut to it for quicker access.
>
> Watch for word-wrap.
>
> <html>
> <head>
> <title>Monster.htm</title>
> <script type="text/javascript">
> function monster(that) {
> var what = document.getElementById("id_select").value;
> document.getElementById("id_picked").innerHTML = what;
> document.getElementById("id_iframe").src = what;
> }
> </script>
> <style type="text/css">
> ..font { font-family:Arial; font-size:8pt }
> ..zero { margin:0px; padding:0px }
> </style>
> </head>
> <body class="zero">
> <form action="" method="get" class="zero">
> <table align="center" border="0" cellpadding="0" cellspacing="0"
> width="100">
> <tr valign="top">
> <th>
> <span id="id_picked" class="font"></span><br>
> <iframe id="id_iframe" width="860" height="600"></iframe>
> </th>
> <td>&nbsp;</td>
> <td class="font">
> &nbsp; &nbsp; &nbsp; <b>Monster Files:</b><br>
> <select class="font" size="19" id="id_select" onchange="monster(this)">
> <option value="">
> <option value="MonstersIntro-A.htm">Monsters Intro-A
> <option value="MonstersB-C.htm">Monsters B-C
> <option value="MonstersD-De.htm">Monsters D-De
> <option value="MonstersDi-Do.htm">Monsters Di-Do
> <option value="MonstersDr-Dw.htm">Monsters Dr-Dw
> <option value="MonstersE-F.htm">Monsters E-F
> <option value="MonstersG.htm">Monsters G
> <option value="MonstersH-I.htm">Monsters H-I
> <option value="MonstersK-L.htm">Monsters K-L
> <option value="MonstersM-N.htm">Monsters M-N
> <option value="MonstersO-R.htm">Monsters O-R
> <option value="MonstersS.htm">Monsters S
> <option value="MonstersT-Z.htm">Monsters T-Z
> <option value=""> - - - - - - - - - - - - -
> <option value="MonsterFeats.htm">Monster Feats
> <option value="MonstersAnimals.htm">Monsters Animals
> <option value="MonstersasRaces.htm">Monsters as Races
> <option value="MonstersVermin.htm">Monsters Vermon
> </select>
> </td>
> </tr>
> </table>
> </form>
> </body>
> </html>
>

I was going to do the conversion manually, with open office. Using word
would cost too much, as I would have to go and purchase it. I do have a
windows box to install it on - games and website testing, but other than
that - linux and open office. The web page you just gave me works fine
either way. Thanks!

Carolyn

 

Navigation:

[Reply to this message]


Удаленная работа для программистов  •  Как заработать на Google AdSense  •  England, UK  •  статьи на английском  •  PHP MySQL CMS Apache Oscommerce  •  Online Business Knowledge Base  •  DVD MP3 AVI MP4 players codecs conversion help
Home  •  Search  •  Site Map  •  Set as Homepage  •  Add to Favourites

Copyright © 2005-2006 Powered by Custom PHP Programming

Сайт изготовлен в Студии Валентина Петручека
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация