You are here: Re: text parsing « PHP Programming Language « IT news, forums, messages
Re: text parsing

Posted by McKirahan on 01/23/08 15:08

"Carolyn Marenger" <cajunk@marenger.com> wrote in message
news:5a282$4797516b$cf70133e$25433@PRIMUS.CA...
> McKirahan wrote:
> > "Carolyn Marenger" <cajunk@marenger.com> wrote in message
> > news:81d7b$47973b05$cf70133e$360@PRIMUS.CA...
> >> McKirahan wrote:
> >>> "Carolyn Marenger" <cajunk@marenger.com> wrote in message
> >>> news:7c0f$4795ea54$cf70133e$1079@PRIMUS.CA...
> >>>> McKirahan wrote:
> >>>>> "Carolyn Marenger" <cajunk@marenger.com> wrote in message
> >>>>> news:74fb1$479501d1$cf70133e$7458@PRIMUS.CA...
> >>>>>> Can someone point me in the direction of some good documentation on
> >>> text
> >>>>>> parsing?
> >>>>>>
> >>>>>> I want to take a bunch of text files (rtf), read them in and dump
the
> >>>>>> contents in a database. The files are effectively a flat file
> >>> database,
> >>>>>> with I suspect some fairly intricate programming needed to process
> > the
> >>>>>> files. Unfortunately, they are laid out for human readability, not
> >>> data
> >>>>>> conversion.
> >>>>> A few questions.
> >>>>>
> >>>>> How many is a "bunch"?
> >>>>> What would the target database be -- MySQL?
> >>>>> What table and column structures do you envision?
> >>>>> Perhaps simply a single table with two columns:
> >>>>> filename (key) and a memo field containing the data?
> >>>>> What is the purpose behind doing this?
> >>>>>
> >>>> A few answers
> >>>>
> >>>> A bunch is about a dozen. Basically one large file that was broken
> > into
> >>>> sixteen subsets, following the initial letter for each record.
> >>>>
> >>>> The target database would be MySQL
> >>>>
> >>>> I haven't looked too closely at the data, but I think one main table
> >>>> with a few linked tables for those cases where there may be more than
> >>>> one piece of data for a category. There are about 25 categories to
> > each
> >>>> record. Eventually there would be additional structure added around
> > the
> >>>> imported data, but that isn't relevant to importing the data itself.
> > (I
> >>>> will confirm this before beginning to code.
> >>>>
> >>>> The purpose: I am a D&D fan and I run games. I would like to be able
> > to
> >>>> reference the material and automate much of the process so I don't
have
> >>>> to lug and reference 20lbs of books.
> >>> Any chance the RTF files are online so I could look at them?
> >>>
> >>> Perhaps http://www.wizards.com/default.asp?x=d20/article/srd35?
> >>> http://www.wizards.com/d20/files/v35/SRD.zip contains 88 RTF files.
> >>>
> >>>
> >>> Also, I gather, this might be a one-time effort; correct?
> >>>
> >>> Not what you requested but ...
> >>>
> >>> I've developed a VBScript solution that takes the following approach:
> >>> for a given folder, each RTF file is opened in MS-Word and saved
> >>> as a text file which is opened and read then saved in an MS-Access
> >>> database table containing 3 columns: id (AutoNumber), file, data.
> >>>
> >>> Using those 86 RTF files it created a 10MB MS-Access database.
> >>>
> >> Yes, they are online. Yes, you can look at them. Yes, those are the
> >> files except I only care about the 16 monster files. Yes, this is a
one
> >> time effort.
> >>
> >> My goal is to create a encounter generation program - where I key in
> >> climate, geography, season, encounter level, time of day, proximity to
> >> civilization, and the application gives me a suggested random encounter
> >> suited to the scenario. For example, if the party was wandering around
> >> the city sewers on a hot summer night, they might encounter a pack of
> >> giant rats being led by a were rat. I would then want the program to
> >> determine how many rats, how many hit points each, and any other
> >> pertinent variable data, including what weapons and treasure the
wererat
> >> was carrying and using.
> >>
> >> Having the rtfs loaded into a database like your script does, would
> >> enable faster searches, it would not go the next step and perform the
> >> various calculations based on the results of the searches. It is a
good
> >> start, but if it has stripped any of the rtf encoding, it may make it
> >> harder to have a script find the various 'fields'.
> >>
> >> Thanks, Carolyn
> >
> > I counted 17 "Monster" prefixed files.
> >
> > My version creates ".txt" files which do strip "the rtf encoding".
> >
> > An alternative version creates ".htm" files which retains the
> > formatting you want; I don't think you really want all of the
> > "rtf encoding" unless you fully understand the specification:
> > (search on "rtf specification".)
> >
> > Perhaps, as an intermediate step, you would like all of the
> > "Monster" rtfs converted to HTML and made available via
> > an interface to open one or more for viewing.
> >
> > As HTML files they consume 7.5MB.
> >
>
> There are a couple of the monster prefixed files that are not listings
> of monsters but other information, such as monsters as characters.
> Anyway, exact number of files is not overly important.
>
> I just did a little test, and looking at the files, I think the easiest
> to work with may indeed be the text file.
>
> Here is an example to illustrate: I am pulling the monster name, type
> and hit dice from each file format.
>
> in rtf...
> {
> \par }{\fs36
> \par DELVER
> \par }\trowd \trgaph108\trleft-108\trbrdrh\brdrs\brdrw10
> \trftsWidth1\trautofit1\trpaddl108\trpaddr108\trpaddfl3\trpaddfr3
> \clvertalt\clbrdrb\brdrs\brdrw10 \cltxlrtb\clftsWidth1
> \cellx1969\clvertalt\clbrdrb\brdrs\brdrw10
> \cltxlrtb\clftsWidth3\clwWidth4871
> \cellx6840\pard \ql \li0\ri0\nowidctlpar\intbl\faauto\rin0\lin0 {\b\fs20
> }{\b\fs19 \cell }{\fs20 Huge Aberration}{\fs19 \cell }\pard \ql
> \li0\ri0\widctlpar\intbl\aspalpha\aspnum\faauto\adjustright\rin0\lin0
> {\fs19 \trowd \trgaph108\trleft-108\trbrdrh
> \brdrs\brdrw10
> \trftsWidth1\trautofit1\trpaddl108\trpaddr108\trpaddfl3\trpaddfr3
> \clvertalt\clbrdrb\brdrs\brdrw10 \cltxlrtb\clftsWidth1
> \cellx1969\clvertalt\clbrdrb\brdrs\brdrw10
> \cltxlrtb\clftsWidth3\clwWidth4871 \cellx6840\row }\trowd
> \trgaph108\trleft-108\trbrdrh\brdrs\brdrw10
> \trftsWidth1\trautofit1\trpaddl108\trpaddr108\trpaddfl3\trpaddfr3
> \clvertalt\clbrdrt\brdrs\brdrw10 \clbrdrb\brdrs\brdrw10
> \cltxlrtb\clftsWidth1 \cellx1969\clvertalt\clbrdrt\brdrs\brdrw10
> \clbrdrb\brdrs\brdrw10
> \cltxlrtb\clftsWidth3\clwWidth4871 \cellx6840\pard \ql
> \li0\ri0\nowidctlpar\intbl\faauto\rin0\lin0 {\b\fs20 Hit Dice:}{\b\fs19
> \cell }{\fs20 15d8+78 (145 hp)}{\fs19 \cell }\pard \ql
>
> ----------
> in .html...
> <P STYLE="page-break-after: avoid"><FONT SIZE=5>DARKMANTLE</FONT></P>
> <TABLE WIDTH=410 BORDER=1 BORDERCOLOR="#000000" CELLPADDING=7
> CELLSPACING=0 FRAME=VOID RULES=ROWS>
> <COL WIDTH=124>
> <COL WIDTH=258>
> <TR VALIGN=TOP>
>
> <TD WIDTH=124>
> <P CLASS="western">
> </P>
> </TD>
> <TD WIDTH=258>
> <P CLASS="western"><FONT SIZE=2>Small Magical Beast</FONT></P>
> </TD>
> </TR>
> <TR VALIGN=TOP>
>
> <TD WIDTH=124>
> <P CLASS="western"><FONT SIZE=2><B>Hit Dice:</B></FONT></P>
> </TD>
> <TD WIDTH=258>
> <P CLASS="western"><FONT SIZE=2>1d10+1 (6 hp)</FONT></P>
> </TD>
> </TR>
>
> ---------
> in .txt...
>
> DARKMANTLE
>
> Small Magical Beast
> Hit Dice:
> 1d10+1 (6 hp)
>
>
> --------
>
> So, looking at that and assuming the rest will be similar, the text
> version looks the easiest to deal with. If document styling such as
> 'title', 'heading' and 'subheading' had been used, maybe not, but in
> this case, a new line seems to denote either a field heading or field
> data. There are exceptions of course - particularly when denoting a
> category of monster.
>
> That doies bring me a little closer to achievign my goal. Thanks for
> the assistance. :)
>
> Carolyn

So I gather you have what you need.

I'd suggest just manually converting the files (via MS-Word Save-As)
rather than automating that part of the process since it's a one-time
effort and there aren't that many files.

Below is a page that will list and allow selection of the "Monster"
files via a dropdown with the page displayed in an <iframe>. The
<select> is on the right to allow quicker access to the scroll bar.
Save it as "Monster.htm" and put it in the same folder as the
"Monster" files as Web pages; (i.e. with a ".htm" extension).
Doubleclick on the filename in Windows Explorer or create a
desktop shortcut to it for quicker access.

Watch for word-wrap.

<html>
<head>
<title>Monster.htm</title>
<script type="text/javascript">
function monster(that) {
var what = document.getElementById("id_select").value;
document.getElementById("id_picked").innerHTML = what;
document.getElementById("id_iframe").src = what;
}
</script>
<style type="text/css">
..font { font-family:Arial; font-size:8pt }
..zero { margin:0px; padding:0px }
</style>
</head>
<body class="zero">
<form action="" method="get" class="zero">
<table align="center" border="0" cellpadding="0" cellspacing="0"
width="100">
<tr valign="top">
<th>
<span id="id_picked" class="font"></span><br>
<iframe id="id_iframe" width="860" height="600"></iframe>
</th>
<td>&nbsp;</td>
<td class="font">
&nbsp; &nbsp; &nbsp; <b>Monster Files:</b><br>
<select class="font" size="19" id="id_select" onchange="monster(this)">
<option value="">
<option value="MonstersIntro-A.htm">Monsters Intro-A
<option value="MonstersB-C.htm">Monsters B-C
<option value="MonstersD-De.htm">Monsters D-De
<option value="MonstersDi-Do.htm">Monsters Di-Do
<option value="MonstersDr-Dw.htm">Monsters Dr-Dw
<option value="MonstersE-F.htm">Monsters E-F
<option value="MonstersG.htm">Monsters G
<option value="MonstersH-I.htm">Monsters H-I
<option value="MonstersK-L.htm">Monsters K-L
<option value="MonstersM-N.htm">Monsters M-N
<option value="MonstersO-R.htm">Monsters O-R
<option value="MonstersS.htm">Monsters S
<option value="MonstersT-Z.htm">Monsters T-Z
<option value=""> - - - - - - - - - - - - -
<option value="MonsterFeats.htm">Monster Feats
<option value="MonstersAnimals.htm">Monsters Animals
<option value="MonstersasRaces.htm">Monsters as Races
<option value="MonstersVermin.htm">Monsters Vermon
</select>
</td>
</tr>
</table>
</form>
</body>
</html>

 

Navigation:

[Reply to this message]


Удаленная работа для программистов  •  Как заработать на Google AdSense  •  England, UK  •  статьи на английском  •  PHP MySQL CMS Apache Oscommerce  •  Online Business Knowledge Base  •  DVD MP3 AVI MP4 players codecs conversion help
Home  •  Search  •  Site Map  •  Set as Homepage  •  Add to Favourites

Copyright © 2005-2006 Powered by Custom PHP Programming

Сайт изготовлен в Студии Валентина Петручека
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация