Post 8ccgrHhgSTY

Paul Strack Feb 20, 2017 (18:07)

I've release v0.5.4 of Eldamo.

Now that PE22 is done, I gone back to working on earlier material. I finished up some left overs in PE12, and made progress on PE14.

I also updated to the latest version of Glaemscribe, which has better support for the Tengwar Eldamar font that I prefer. I strongly encourage anyone who needs to do tengwar transcription to look at this tool. It is super useful:

https://github.com/BenTalagan/glaemscribe
Eldamo : Home
Eldamo - An Elvish Lexicon. by Paul Strack — v0.5.4 — generated on February 19, 2017 8:24:48 PM PST. This collection of documents is a lexicon of Tolkien's invented languages, particularly his Elvish languages, which are the most detailed. The collection is called a “lexicon” because it is not a ...

Tamas Ferencz Feb 20, 2017 (18:22)

Thank you, +Paul Strack​! Eldamo is just getting better by the day.
May I have a feature request: would it be very difficult to add 'part of speech' as a filter to the search page?

Paul Strack Feb 20, 2017 (18:25)

+Tamas Ferencz That shouldn't be too hard. I added it as a feature request for the next version, so I won't forget:

github.com - eldamo

Rick Spell Feb 20, 2017 (21:43)

Alacarna!

Rick Spell Feb 21, 2017 (00:10)

I've been hoping for something to help me with the tengwar, and this is what I need!

Lúthien Merilin Mar 13, 2017 (23:24)

Hello Paul,

I'm working on the relational schema based on the Eldamo data (as I wrote some time ago) and I was wondering about one thing in particular that I can't figure out. It's how the Tengwar as seen on some Word pages is generated from the data set (eldamo.xml) ... take for instance the 'Naltariel' page: eldamo.org/content/words/word-470878163.html

If I look at the corresponding part of the xml file (the first tag):
speech="fem-name"
page-id="470878163">

while the page source looks like this:

g#j1E7T`Vj

I can understand how the transcription from "g#j1E7T`Vj" to the Tengwar as seen on the page happens, but I must admit that I'm fazed by how one arrives at "g#j1E7T`Vj" starting from just "ñ"?
I obviously miss something, but what? :)
eldamo.org - Eldamo : Quenya : Naltariel

Paul Strack Mar 14, 2017 (00:47)

+Lúthien Merilin The transcription happens when the page loads using Javascript. I would recommend runtime transcription over storing the transcription in the database because it makes it much easier to support future changes in Elvish fonts.

The actual JS logic takes the word value, replaces characters as appropriate based on the tengwar field (n > ñ, s > th), and then runs it through the glaemscribe transcriber.

Lúthien Merilin Mar 14, 2017 (01:23)

+Paul Strack agreed re. not storing transcriptions; I just wondered what the input field for that was - so it's just the word form! Go figure, I half-expected that transcription required several steps and hence some or other in-between pre-compiled form. Eru knows where I got that idea from ...
Thanks!

There are some other things about the schema that puzzle me ... I'm afraid that working with relational data has left my brain somewhat xml-challenged. I'll first gather them all and post them together.

Tamas Ferencz Mar 18, 2017 (15:18)

Paul Strack Mar 18, 2017 (17:47)

+Lúthien Merilin Regarding the data elements in the Eldamo XML schema, bear in mind that you are wading through 8 years of detritus. The model is full of half-finished projects and discarded ideas.

I worked on the XML first, and the schema is an afterthought, not a predefining structure. The schema describes what is present in the XML rather than the ideal data structure. In fact, the first draft of the schema was produced by +G. Hussain Chinoy, not by me (and I just realize I never credited him for it).

I would recommend against trying to digest the entire model at once. You are likely to end up with a lot of garbage. Instead, focus on the data elements of interest to you, and incrementally expand on them.

+Severin Zahler has been working on a similar project to convert Eldamo to a relational model, and I believe the author of elfdict.com - Parf Edhellen ~ Parma Eldaliéva (whose name escapes me) went through a similar process for his Eldamo import.

That said, I'd be happy to try and answer any specific questions you have about the model. I find such discussions very useful, because they usually give me ideas on how to improve the model.

I will try to answer some of your questions above in another post.

Paul Strack Mar 18, 2017 (18:15)

Here is a bit of background information on how words, references and forms inter-relate within the model.

The atomic units with the XML model are "references" (ref elements). A reference denotes a specific thing that J.R.R. Tolkien (and occasionally Christopher Tolkien) wrote down, along with some contextual information such as English glosses written next to it. The model organizes all those individual references into "words" (word elements) that are candidates for an Elvish word at a certain stage in Tolkien's like (Early, Middle or Late).

Some of the references within a word differ from each other, and the most common reason for these variations is differing grammatical expressions. For example, both alda and aldar references appear under the word alda "true", as the singular and plural grammatical expressions of that word.

As a general rule, the "form" attribute is intended to attach some grammatical meaning to these variations. The aldar reference has an sub-element indicating it is a plural form, while aldaron has an sub-element indicating it is the genitive plural.

These sub-elements can be used for multi-dimensional grammatical analysis. For example, consider this page:

eldamo.org - Eldamo : Quenya : genitive

On this page, I've collected all the references to various Quenya words that have a "genitive" form, so I can examine how the genitive syntax has evolved over various periods in Tolkien's life. At some point I hope to do a detailed write up on this development, as I for some of the minor languages like Adûnaic and Ilkorin.

The above describes the idealized structure of the data model. This ideal is not consistently followed. In particular, some of the "words" in the data model are not really words, but are other kinds of entries in the lexicon. As you noted above, this is generally determined by the "speech" attribute of the word.

For speech="grammar", the word is instead an entry discussing a particular bit of Elvish grammar, such as the "genitive" entry above. I also have "word" entries for things like a "phoneme" or "phonetic-rule". Depending on the nature of the word, its form expressions may have differing meanings.

Lúthien Merilin Mar 19, 2017 (14:56)

Gee thanks, +Paul! That's very helpful!

Although I had already noticed that, looking from the data, "Everything is a Word", including the grammar, phonetics etc. pages I had not yet realised that this could imply that the child attributes might acquire another meaning as well. It's supposedly my 'relational data tunnel vision' acting up again!

+Ekin Gören mentioned it while we were discussing it and I found indeed that the 'Speech' attribute provided the context information that I missed. After that, I deleted my original question about the "form" attribute because that was pretty much all answered.

Re. "I would recommend against trying to digest the entire model at once." - thank you, that's a fair warning I guess!

However, it has been our intention from the beginning to create a model that is capable of accommodating even the most complex thinkable entries, plus having some other features (there's a list that we compiled somewhere, but I don't have it handy).
So I do want to put some serious effort in creating the model. The Eldamo data are a great starting point for that. Not to convert it blindly, but because it seems to result from a similar aspiration to be as thorough as possible.

I've often experienced that understanding arises almost as by-product of handling something, and this is no different. Right now I'm looking at a very preliminary ER diagram that arose from a first analysis of the Eldamo schema, and even as I was adding entities, shoving them around and merging them there came these "Aha!-Erlebnisse" as a certain piece of what it's supposed to do suddenly dawned on me.
That's what makes this such a fascinating job ... I really enjoy it, even though it sometimes gives me a headache!

We'd only import the data after that model is settled upon, and we can always do that incrementally, indeed.

I'll share the design as it evolves! I should have a first one sometime later today.


Lúthien Merilin Mar 19, 2017 (15:01)

I'm aware of Severin Zahler's project, we even briefly considered asking him to work together; but his project seems to have a different scope as what we want to achieve. Of course, it will be interesting to compare what we've come up with!

I have not yet discussed this with Leonard of Elfdict.com, and since I never descended into the social labyrinths of Google+ I have no idea how to send him a message (searching the help files suggest something called "Hangouts" but that sounds a lot more ponderous than what I have in mind). Maybe it works like on assembla.com, if I just quote his name here with the + in front of it, like this: +Leonard W. ?
If you read this, - well, the above is mostly self-explanatory, but to summarise: I'm trying to reverse-engineer a relational database schema from the xml-based Eldamo data set (I've done the same years ago with the also xml-based schema of Hiswelokë).
I registered at Elfdict to see how words are contributed; did some searches and looked at the search results display.
I suppose Elfdict uses its own data model - am I right to assume that the form to add a word mirrors that model?
But I am mostly interested in how you went about importing the Eldamo data; and if you considered it when settling on the elfdict data schema?
Thanks!

Paul Strack Mar 19, 2017 (16:12)

+Lúthien Merilin I applaud your effort to be ambitious. If you want to be completely comprehensive, then Eldamo might be a decent place to start as a source for ideas. You probably don't want to replicate everything, but at this point I think Eldamo has most of the kinds of relationships you want to examine.

Some of the stuff in the Eldamo data model has more to do with how data is presented rather than being data itself. In particular, the appearing in grammar entries has control information used in the Eldamo rendering logic to produce at table of grammatical examples. I am sure there are a few other elements like that as well.

If I were able to start over, I probably would not do the "everything is a word" approach, and would split out grammatical and phonetic entries as some other kind of entity. At a minimum, I'd rename word >> entry, which is a more accurate expression of its function.

There are a number of other design mistakes in the model, the result of the way it evolved over time. That is very natural, of course. Any data modeling project is a series of compromises, and you always discover that your early, simpler models need to be enhanced to meet new requirements.

That's part of the reason I was advocating an incremental approach. You can't really know whether you data model is correct until you start using it, and when you do you invariably find flaws in it.

However, as you suggest, you might be able to jump immediately to a pretty sophisticated model by trying to cram the entire Eldamo data set into it. You definitely won't want to replicate the Eldamo model exactly because (a) XML and relational models are naturally different and (b) there are flaws in the Eldamo model you won't want to reproduce.

Still, I am very interested to see what you come up with. I may steal some of your ideas and migrate them back into Eldamo.

Lúthien Merilin Mar 19, 2017 (22:27)

+Paul Strack
"You probably don't want to replicate everything, but at this point I think Eldamo has most of the kinds of relationships you want to examine"

Indeed! Roman and I have worked occasionally on something like this in the past few years, and there's a post here in Google+ where he listed what he thought was needed. All of that seems to be present in some form.

"Some of the stuff in the Eldamo data model has more to do with how data is presented rather than being data itself. In particular, the appearing in grammar entries has control information used in the Eldamo rendering logic to produce at table of grammatical examples. I am sure there are a few other elements like that as well. "

I noticed that indeed. That sort of information will indeed not be necessary, but on a first glance I wasn't completely certain what could be safely omitted - for instance, the element is supposedly necessary. Boy OTOH, it might be that there is some information contained within those control-entities that will prove useful in the end, so I figured that the best I could do is to include all those in the first iteration of the schema. It is after all easier to delete everything that’s not needed at the last stage, than finding out that too much as been thrown out in the process :)
When a more truly relational ordering of the entities arises, many entities will merge, some will split up and others be deleted in the end, I’m sure.

"If I were able to start over, I probably would not do the "everything is a word" approach, and would split out grammatical and phonetic entries as some other kind of entity. "

:) That is interesting! That was exactly what I did today, though I applied it on the level implicit "form" (the v= attribute) in a "form" and "grammar" entity.

"At a minimum, I'd rename word >> entry, which is a more accurate expression of its function. "

... indeed!

"There are a number of other design mistakes in the model, the result of the way it evolved over time. That is very natural, of course. Any data modeling project is a series of compromises, and you always discover that your early, simpler models need to be enhanced to meet new requirements. "

As with any endeavour, indeed. Sometimes the opposite also happens, when sudden insight strikes FLASH KABOOM FIZZ that a simple change allows you to merge three hitherto separate entities.

“That's part of the reason I was advocating an incremental approach. You can't really know whether you data model is correct until you start using it, and when you do you invariably find flaws in it. “

Of course, and that’s ok.

“However, as you suggest, you might be able to jump immediately to a pretty sophisticated model by trying to cram the entire Eldamo data set into it. You definitely won't want to replicate the Eldamo model exactly because (a) XML and relational models are naturally different and (b) there are flaws in the Eldamo model you won't want to reproduce.”

What I am trying to get across is that I don’t intend to squeeze the whole of Eldamo into the new schema; what we want to preserve is in any case the richness of the Eldamo schema.
I believe that the best way to do that is to start with a 1=>1 migration as the first iteration: for some reason I’ve always preferred to be able to “work the material myself” - to manipulate it, hold it in my hand, if that makes any sense (not literally of course in this case!) as opposed to ponder the problem in abstraction, from a distance.
It is supposedly related to my preferred learning style. It was even like that in school, I always want to “do it myself”, to experiment, to hold it in my hand; that makes it much easier to understand.
Maybe it is related to a certain way of imagining abstractions.

So no worries, I am by no means planning to guzzle up the whole of Eldamo and generate a monstrous ERD that inherits those eight years of design evolution with all its accumulated fuzziness! Even though the first iteration of the schema that I arrived at today looks monstrous enough: whatever that XML-to-RD generator came up with was still seven times as monstrous (I never bothered to create a graphical representation of it, though it might be fun to do that).

“Still, I am very interested to see what you come up with. I may steal some of your ideas and migrate them back into Eldamo.

Well, here is that first iteration, with only a minimum of editing, like splitting out the GRAMMAR from the FORM attributes and a couple of other minor things. But it’s only the beginning!


https://plus.google.com/photos/...

Lúthien Merilin Mar 19, 2017 (22:29)

I'm pretty certain that the final model will have only one third of the number of entities.

Paul Strack Mar 19, 2017 (22:50)

It sounds to me like the approach you are taking will work well for you. And I agree you can probably trim out two thirds of the junk. In XML it very easy to have a proliferation of entity types that make no sense in a more rigorously defined relational model.

In fact I just added a new one yesterday: see-also which is a variation of see-further with slightly different rendering characteristics.

Lúthien Merilin Mar 20, 2017 (19:58)

I wondered if it is always possible to narrow down the type of a given @v attribute based on the element that it appears in? I mean first and foremost whether it is possible to split out the grammatical and phonetical entries.

So far, I've figured out some of it, but I'm not 100% certain because most of that is based on looking at some randomly sampled values.
Could you indicate if the below is correct?


1) under the WORD element, the type of @v is determined by the @speech="" attribute:

grammatical values of @speech=""
grammar
infix

phonetical values of @speech=""
phoneme
phonetics
phonetic-group
phonetic-rule

the rest seem to be word-forms, except that I am uncertain about what these should fit in:
?
text
phrase

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
2) @v in the following elements seem to be phonological:
- BEFORE

@v in these elements seem to be grammatical :
- REF

@v in these elements seem to be word-forms:
- CORRECTION
- DERIV
- ELEMENT
- EXAMPLE
- INFLECT
- ORDER-EXAMPLE

... and could the category of @v occurring in these elements be dependent on the parent element?
- CHANGE
- COGNATE

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
3) As used in the elements listed below, @v and @l taken together seem to function as a reference to a WORD element, as in foreign key columns (as is also suggested in the documentation, eg. Linked to another word with notes discussing this word, using a word reference (@l + @v)

- BEFORE
- COGNATE
- DERIV
- ELEMENT
- RELATED
- SEE
- SEE-FURTHER
- SEE-NOTES

> Question: does this then mean, that establishing a link to another WORD element is the only function of the @v (and @l) element, or said otherwise: does this mean, that for every combination of @v and @l as they occur in these elements, there exists a WORD element that has this same combination of @v and @l?

> this would then also mean that @v and @l uniquely identify a WORD element

_note: this would imply that I do not need to process the forms found there other then to match them as foreign key values. In the relational model then can then be replaced by, for instance, entry-id _

... and now for the complexities ;) ....
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
4) @rule, @form in RULE and RULE-EXAMPLE; and @to in RULE-START seem all to contain phonological forms.
However, @to as found in RULE-EXAMPLE also seems to contain regular word-forms.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
5) In element CLASS, the @form attribute always contains a link to a grammatical form

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
6) in element DERIV, @form is always grammatical; @i1, @i2 and @i3 are word-forms, as @v, see above -> 2

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
7) in element ELEMENT, @form is a grammatical form and @v is a word form, see above -> 2
Incidentally, the documented @variant attribute never occurs in the data

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
8) likewise, in element INFLECT, @form is a grammatical form, and so is @variant. @v is a word form as well, see above -> 2.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
9) In the REF element, @from is a phonetic form, just like @v , see above -> 2.


I hope I've covered them all now :)

Lúthien Merilin Mar 20, 2017 (21:28)

I just came across this REF element for page-id="4164672875"


which seems to contradict the documentation: "Reference to an attested form (@v) in Tolkien’s writing (...)

-> is this intentional or an error?

[edit] on second thoughts; it could be that it is also determined by the parent WORD's speech="phonetic-rule" type

Paul Strack Mar 21, 2017 (01:57)

+Lúthien Merilin The "phonetics" references are mostly placeholders to partially evaluated phonetics-rules. The phonetics analysis takes a very long time, so I tend to leave markers for things that need further evaluation later.

grammatical values of @speech=""
grammar
infix

The infix is word entry, like suffix and prefix.

the rest seem to be word-forms, except that I am uncertain about what these should fit in:
?
text
phrase

speech="?" are generally words whose part of speech is unclear (generally unglossed words).

speech="phrase" are entries for sentences or phrases, while speech="text" are entries for block of text or poems, like Namárie.

I think I see your problem with the @v entries in other elements. I will write them up separately in another post.

Paul Strack Mar 21, 2017 (02:21)

The @v attribute indicates a word or reference value only in the case of and elements. To put it another way, only a word or a reference can genuinely have a value.

The @v attribute in other contexts constitutes part of a foreign key link to another entity.

If the XML elements is a child of a element, this FK link is to another (with the sole exception of children). For elements that are children of elements, the link is to another .

The FK link to a is a joint-key composed of the @v and @source attributes. Strictly speaking, only the @source attribute is really needed, but the code also checks that @v attribute to help prevent me from screwing up the model as I manually edit references.

The FK link to a is a joint-key composed of the @l and @v attributes. Both are needed, because the same word value may appear in multiple languages.

Consider these three word entries as examples:
















In the above, Q. alda has a word-link to S. galadh, via the child element, asserting that it is the cognate of the Sindarin word.

Furthermore, the Q. alda references at Let/426.3504 has a ref-link to PE. galadā at Let/426.3310 via the child element , which asserts that Q. alda is a derivative of PE. galadā, and furthermore this derivative relationship is directly indicated on Let/426.

Once you recognize that @v attributes on elements other than and are part of FK-links on various relationship elements, I think the model will be easier to understand. Furthermore, most of the same types of relationships can appear under both and attributes, the only variation being whether it is a word-link or a ref-link.

Not that the FK-links are not explicitly described in the XML schema, because XSD is ridiculously slow at parsing foreign key relationships, so I evaluate the FK-links using XQuery logic in the rendering engine.

Paul Strack Mar 21, 2017 (02:25)

I need to write up another entry on how phonetic rules interrelate, but that may need to wait until another day. Also, reading back through your notes its seems you were already half-way to figuring out the above, but hopefully my previous post will add some clarity.

Severin Zahler Mar 21, 2017 (12:05)

Hey and sorry I havent noticed the discussion here just yet :D It's very interesting to see a bit a more fundamental discussion about whether or how to work with Eldamo, also very interesting approach +Lúthien Merilin!

While eventualy my goal is to not leave any valuable data and the work that was invested in gathering it behind I did rather start like +Paul Strack proposed initially: To not tackle entire Eldamo at once, but rather selectively start with the most important elements and attributes and expand the complexity gradually.

My personal goal is to have a database which can be used for efficient translations and also should be user-friendly and simple enough for the average LotR-fan, but also provide all the options and informations the more advanced translators would be happy to use and browse.
I most definitely will also include any neologisms I deem trustworthy (although of course labelling them accordingly to be able to filter them out when querying the database).

The way I started creating the ERD was by asking the questions "what informations do I want to gain from the database?" and then started modelling that. Only then I started studying eldamo more closely, Paul's documentation was very helpful for that: eldamo.org - XML Schema Documentation
From there on I did work on expanding the model in order to be able to house all the data eldamo contains. The structure I really only regarded, if it contained a bit of data I wanted to preserve.

Over the past couple weeks I was only indirectly working on my conlang DB project, i.e. I was indulging into the (no longer) mysterious worlds of PHP and JavaScript, and refreshed my knowledge on Bootstrap, HTML and CSS.

To get back into the fun of the data export I just today started writing an extensive documentation about my export progress. It's still heavily WIP and by far not everything I've treated so far is in it, I'm going by the order I have it in my Java code, and so far I've only covered about 500 of the 1000 lines of code :P If you want to check it out already, follow this link: https://docs.google.com/document/d/1CY-ys_C4i5UCGkBpqKHFfydlRzraEKYDlS6qG77GEgU/edit?usp=sharing

Regarding some of the concrete points you discussed above:

I did also consider splitting the elements into actual entries you'd find in a dictionary, and split off things like grammar and phonology. But ultimatively I could not find any technical advantage in doing so, the only real change is that the ERM gets more complicated and the export requires distinguishing even more things. The only advantage I could think of is that the queries would get a bit easier if you only want to search for the "dictionary" words, but as I anyway plan on making a nice and flexible frontend this is nothing to worry about for the user, and the work to assemble the queries is most likely less big than extending the ERM for this.

As was suggested elsewhere I did not make an own entity for each of the many relation-type elements (change, correction, relation, deriv...) but instead a generic relation table which houses the foreign keys of up to two words from either the word table or the table of the elements ("attested_record" in my case). I even proceeded to have the glosses just be a sort of relation and to connect the glosses to the words over this generic relation table. For things like intermediary forms I added another field for additional information on the relation.

My current ERD is looking like this:
http://i.imgur.com/qbtGJjk.png


Lúthien Merilin Mar 21, 2017 (23:12)

Today's state of the model! I already normalised a good many tables out, but some more to go ...
https://plus.google.com/photos/...

Lúthien Merilin Mar 21, 2017 (23:13)

hi +Paul Strack and +Severin Zahler thanks for your comments; I will answer tomorrow!

Lúthien Merilin Mar 21, 2017 (23:17)

+Severin Zahler - just curious, why those fifty VARCHAR(45) columns in the conning_word_inflection table to the left?

Paul Strack Mar 22, 2017 (00:50)

Here is the promised discussion of the phonetic-rule references. This is a quite obscure area of the Eldamo model: my attempt to analyze the phonetic changes of Elvish words from their primitive to modern forms. This analysis is only "complete" in the case of Adûnaic, Ilkorin and Danian. Everywhere else it is a work-in-progress, and likely contains quite a few errors. Even in the case of the "complete" languages, my analysis is frequently guesswork, and should be used with caution. Furthermore, the XML structures are far from ideal: I stopped working on the design once I got something that worked for my purposes.

The atomic elements in the phonetic rules are the elements, which are always children of a , which I will call a phonetic-rule below for simplicity.

Each rule elements has three attribute: @l (language), @rule (resulting phoneme) and @from (original phoneme). For example:



These three attributes together compose a triplicate key, and as such can be used as rule-link foreign keys to each rule. More on that in a bit.

Note that a specific phonetic-rule may contain multiple related elements:






If there is more than one element, then the phonetic-rule itself contains a similar triple defining the "combined" rule. In the above example, this would be:



There are no direct links to the combined rules, but in some hypothetical future I plan to build an engine that applies the combined rules to various attested phonetic developments to verify the correctness of my phonetic analysis.

Paul Strack Mar 22, 2017 (00:51)

There are two possible links to a element: (1) a phoneme change and (2) a .

A phoneme change is always a element attached to a specific phoneme or phoneme cluster:







The phoneme change rule-link uses a @rl attribute for the rule language instead of @l, because a ref can (in theory) have a distinct @l attribute: the language of the phoneme itself. The phoneme change is for actual attested phoneme derivations expressing a particular phonetic rule.

The rule-link is an example of a particular in the phonetic evolution of an Elvish word:










The triple key is in effect a rule-link to the original example above. The @to attributes indicate the various stages the word moves through during its phonetic evolution. The attribute name @to is an artifact of an earlier abandoned approach, and is is not a very good choice. I should probably change it to @stage. In fact, I think I'm going to do that with the next release of Eldamo.

Continuing the example above, the stages of phonetic change indicated by the @to attributes are baltār > baltōr > balθōr > balθor. The never has a linked rule, but each subsequent has a rule-link. The only exception is when I have not yet identified a general rule for a particular phonetic change, in which case I use rule="?". For example:










The first change gālæ > gāla doesn't have a supporting phonetic , either because I haven't finished analyzing it or because I simply can't figure out what Tolkien was trying to do.

Getting back to the phonetic-rules themselves, there is some additional information I use to assess the order of phonetic changes: @order attributes and elements.









The @order attribute of the phonetic-rule determines the order in which phonetic-rules are applied. The element asserts that a particular phonetic change must have occurred before another change, and the is a ref-link to a reference whose phonetic development supports this assertion. I have code that compares the @order attributes, the assertions and the various specific elements to verify that my analysis of the ordering of phonetic changes is consistent.

Paul Strack Mar 22, 2017 (00:51)

One final detail. In the discussion above, I said that the elements describe the change of phonemes, but more accurately they describe changes according to phonetic patterns, as described in this page:

http://eldamo.org/general/phonetic-descriptions.html

Consider this example:










The combined phonetic-rule pattern rule="V̄{ptk}" from="Vh{ptk}" indicates that Vh (vowel + h) became V̄ (long vowel) when appearing before any of voiceless stops {ptk}. This pattern is broken down into the constituent elements so I can better track specific examples of the change for each of the consonants, though in this case I've only found rule-examples in the case of rule="V̄t" from="Vht":

























I've also found one discussion of a phoneme change that supports the rule, though the exact phonetic representations that Tolkien used aren't a perfect match:







This example illustrates why I needed to separate @rl from @l: Tolkien described this phoneme change as part of a discussion of Beleriandric (bel), but I am treating this as an example of an Ilkorin (ilk) phonetic changes, since Beleriandric is the last iteration of Ilkorin before Tolkien abandoned that language.

So anyway, that's a blizzard of detail, hopefully not too overwhelming. The phonetic stuff is very complicated and quite subjective, which is why it is so poorly documented.

Severin Zahler Mar 22, 2017 (08:14)

+Lúthien Merilin This is a generic inflection table. Initially I had a table for every sort of inflection, i.e. Quenya nouns or Sindarin verbs, but my technical "coach" suggested only having one table with generic column names (one for each inflection, i.e. col1 may be Nominative Singular if you inflect a noun or 1. Person Sg. if you inflect a verb) and labelling the column using MySQL VIEW.

The data for this table I will not get from eldamo, although I will export the elements separately to have a confirmed stock of information to start from. The current idea is that I only fill in those inflections that have any sort of exception, those who have inflections that follow specific formation rules I want to generate programmatically at runtime. The "has_inflection" field in the conlang table serves as information bit whether the inflections are found in that inflection table or not.

The number of 50 columns doesn't represent a special value, it just should be high enough to fit all inflections of a certain type (take Quenya nouns with around 12 cases and each in Sg, Pl, Partitive and Dual yields already 48 forms) The string length of 45 is just the default of MySQL Workbench and I will most likely increase that still.

Lúthien Merilin Mar 23, 2017 (00:14)

+Paul Strack - thanks a ton, this is all VERY helpful!

Once you recognize that @v attributes on elements other than and are part of FK-links on various relationship elements, I think the model will be easier to understand. Furthermore, most of the same types of relationships can appear under both and attributes, the only variation being whether it is a word-link or a ref-link.

You know, I had a hunch about that, also because it is suggested in the documentation page. What made me slow to realise it in full is, again, the habit of thinking about FK's as containing key values - ie. not real data; the notion of those "v=" attributes somehow containing "other word forms" kept getting in the way, as it were :)

I even did some tests to manually find correspondences between RULE and RULE-EXAMPLE elements like that, and indeed found them. All in all it is also an interesting exercise in loosening up some ingrained thinking habits!

Not that the FK-links are not explicitly described in the XML schema, because XSD is ridiculously slow at parsing foreign key relationships, so I evaluate the FK-links using XQuery logic in the rendering engine.

I see, yes. I suppose that given all the information I have now it should be doable to figure things out.
I haven't had time yet today to read everything, but I surely will in the coming days.

Thanks once more, I very much appreciate your help!

Lúthien Merilin Mar 23, 2017 (00:50)

+Severin Zahler

what I meant is: why don't you create another table in a 1:n relationship to table conlang_word_inflection to store all those inflect types in? That'd save a lot of complexity and leave the number of inflect types free - see diagram:
https://plus.google.com/photos/...

Severin Zahler Mar 23, 2017 (08:41)

Hmm had to think a couple minutes about what you mean, but I think I get what the idea is:

So, in the conlang_word_inflection I'd just list all inflected words in one column (i guess the one you labelled "comment"?), basically completely unsorted (i.e. after a inflected quenya noun the next could be a declined sindarin verb), the inflection_type foreign key would explain what sort of inflection it'd be.

In the Inflection_type table I'd have somewhat the description of every possible inflection, i.e. "1st person plural inclusive" in the "COL" column, and the "COLNUMBER" to specify in what order the inflections should be placed in tables or similar.

I don't entirely understand it yet though: Why do you have the 1:n relation this way around and not the other? How I read it is that "every specific inflected word has many inflections types", but "every inflection type has only one inflected word". It should be the other way around; an inflection type can be used many times, but every inflected word has only one inflection type.
Connected to that it seems strange that you have a foreign key in the inflection_type table and not a primary key that can be referenced from the inflection table.

I agree, technically this would be safer and cleaner, I really consider changing this! It is just incredibly counter-intuitive for me, but as there will anyway be an interface which scrambles up the data quite a bit to make it readable this is not that grave.
One technical issue that does persist though is the following: Some inflection types appear in multiple languages, but as they might have a different count of inflections the COLNUMBER would vary. So it'd be necessary to make every inflection type language-specific (i.e. distinguish between Sindarin 3rd person plural and Quenya 3rd person plural) to circumvent this problem. Easiest may be to even link the conlang table to the inflection_type table and have each inflection type have a foreign key to the language table.

Thanks for the input!

Lúthien Merilin Mar 23, 2017 (14:50)

Well, the only thing is that the INFLECTION_TYPE table contains an arbitrary number of orderable varchar(45) fields that would take the place of all the columns that you have now hard-DDL'd in the CONLANG_WORD_INFLECTION table; it would take just one join to retrieve them all for a particular row of the CONLANG_WORD_INFLECTION table, where the colnumber integer would supply which "column" it's about (of course you would have to first supply that when creating the row in INFLECTION_TYPE).

If I understand you right, your intention is to have some sort of specific meaning in mind for those columns - for instance, containing 1st, 2nd, 3rd person declinations or what have you - and something else in case it is a noun?

It seems to me a bit like a dilemma between, one one hand, making something generic; and on the other, to impose a structure on the data via the model. I'd say that those fifty columns as it stands now don't add any structure (because they don't have any), so I would either make them more meaningful (1st, 2nd, 3rd person, etcetera ... ) - in which case you would lose some flexibility - or normalise them out to another table (as in that little drawing I made). In that last case the data themselves will carry the structure. OK, maybe I should give a small example:

I think using a WORD table in a 1:n relationship with something like WORD_INFLECTION with only one content column, and an additional INFLECTION_TYPE list-of-values table (and possibly a TENSE list-of-values type) would be enough, something like this, using the present tense pronomial inflections for the Sindarin verb pada- :

WORD
[ID] [word]
11 pada-

WORD_TYPE
[ID] [type]
42 a-verb

LANG
[ID] [lang]
3 Sindarin

TENSE
[ID] [tense]
4 present tense

INFLECTION_TYPE
[ID] [lang_ID] [wordtype_ID] [tense_ID]
7 3 42 4

WORD_INFLECTION
[word_ID] [infl_type_ID] [order#] [txt]
11 7 1 padon
11 7 2 padog
11 7 3 padol
11 7 4 padof
11 7 5 padab
11 7 6 padodh
11 7 7 padar

(the first row under the table header is the column name)
Note - to encode more meaning in the model, the order# could also be listed in a separate table, but that would then of course be dependent on the word-sort.

But then again, I'm a bit of a normalisation freak :)

Lúthien Merilin Mar 24, 2017 (00:05)

Lúthien Merilin Mar 24, 2017 (12:39)

This is probably near the final version:

- unified the GRAMMAR / PHONETICS again with the FORM type, since they are completely exchangeable
- generalised NOTES to generic DOCS
- generalised the following types into LINKED & LINKEDTYPE:
CHANGE
COGNATE
CORRECTION
DERIV
ELEMENT
DERIVEXAMPLE
INFLECT
INFLECTEXAMPLE
RELATED
under both ENTRY and REF
- generalised the SEE, SEE-NOTES, SEE-FURTHER into SEE type attribute
- added ordering attribute to all 1:n and n:m relationships to fix ordering; where necessary combined with type (this makes it possible to store a fixed ordering for different types of that particular intersection entity under a common parent entity - ok, in everyday language: even though the above-listed entities are encapsulated by the LINKED type, it remains possible to maintain the given ordering for, say, INFLECT elements of an ENTRY among one another
- I somewhat rehashed how the RULEs work: they can be chained together with the nextruleexample_id attribute in RULE_EXAMPLE, with the DERIV_RULESEQUENCE as a starting point

I'll probably make some minor changes, but overall I think this is what it converges into.
https://plus.google.com/photos/...

Severin Zahler Mar 24, 2017 (16:06)

Thanks a ton for elaborating on your idea; I did get your idea right it seems, at least mostly!
Generally I am also striving for maximum normalization, here I'd make a small exception though, but instead of talking my mouth fuzzy I rather give you the updated ERD and explain from there:
https://plus.google.com/photos/...

Severin Zahler Mar 24, 2017 (16:18)

Alright, bear with me, the table name nomenclature is making this way more confusing than it should be, but after some thinking I think this is a pretty reasonable way of naming stuff.

The conlang_word_inflection table contains a list of all inflected word forms. Obviously it has an FK for to which word it belongs as well as one FK describing what sort of inflection it is. As small bonus there is a field which can be used to mark whether a specific inflected form is regular or exceptional.

The inflection table describes one very specific inflection. Think of an empty inflection table, then each field's content would be characterised by an "inflection". Such an inflection could i.e. be "Aorist 1st Person Plural inclusive" or "Nominative Dual". I know, that's not normalized and nicely atomic, but I do not see any gain in splitting this up into another heap of tables;
The order I did put by the inflection and not the specific inflected word, again imagine the inflection table, no matter what i.e. noun you decline or verb you conjugate, the tenses and persons resp. cases will be the same. (Exceptions like impersonal verbs can be solved otherwise, i.e. using that "exceptional" field).
I did not link the language to the inflection table because it would be redundant. It's already linked to the word table, which ultimatively is linked to the inflection table.

The inflection_type lastly contains the things like "a-verb". It's connected to the word-table as it is word-specific, and again, the i.e. persons and tenses are not affected by what inflectional type a word has.


Thanks again for kicking off the discussion on this, the longer the more I see how silly the previous version was :P

Lúthien Merilin Mar 28, 2017 (00:43)

Some corrections, as I realised that what I considered to be Form-Type (e.g. the Word->Speech attribute, but also Element->Form) turns out to be linked rather to the parent Word or Element element than to the form.

(this reminds me of that time that it first dawned on me that nothing will stop you from going on normalising a model until you end up with just one huge table with maybe four or five columns (id, value, type, parent_id) where all semantics are implicitly contained within the data and the parent-child relations between the rows. It would be very efficient, but completely human-unreadable, requiring honking large views to unravel the data to something intelligible. A slightly scary but still fascinating thought ... ;) )
https://plus.google.com/photos/...

Lúthien Merilin Mar 31, 2017 (17:15)

+Severin Zahler sorry, I seem to have missed your answer! I'll answer asap.

Lúthien Merilin Mar 31, 2017 (17:30)

+Paul Strack - I wonder if there's yet an easy way to distinguish which forms can be considered equal in the case of differences in initial capitalisation and accented vowels, as in:

turambar
Turambar

or

Eonwe
Eonwë

... at times leading to quadruple versions like:

Manwë
manwë
Manwe
manwe

just making up those examples here, so they might not occur for real. But there are many near-duplicates of that sort in "the list of all forms" that results from scanning the form-like attributes (v, form, from, etc.) of eldamo.xml and removing the duplicate ones

Some of those seem quite straightforward, like with Manwë - proper names etc can be determined by the speech attribute.

In some other cases the differences in accents are clearly illustrating phonological changes and are meaningful. But not always.

So I was wondering if you might know something to look at - for instance specific speech types, or languages, where those differences can be ignored, or maybe where they might actually be meaningful?

Thanks!

Paul Strack Mar 31, 2017 (17:47)

+Lúthien Merilin Which forms are and are not "the same" is unfortunately quite subjective. For the most part you can normalize and ignore case and variants like ë, but not always. For example Voronwë (a name) is distinct from voronwë (a word).

That's the whole point of Eldamo's organization of references within words. A reference is a specific thing that Tolkien wrote, exactly as he wrote it. A word is my assertion that a set of reference are all variations on the "same" word. The organization of references into words is just my opinion, really.

Marc Barceló Tost Mar 31, 2017 (20:18)

Hi +Paul Strack! First of all, thank you for your amaaazing work! I've been a quenya researcher since high school and your work has helped me a lot!
I have a doubt I'd like to question you about: I have a personal interest for the verb hosta- (collect, gather). Through your website I discovered it comes from *KHOTH. My doubt concerns the tengwar writing, since you wrote it with a normal "s", but in fact my guessing is it should be written with the "thule/sule" letter, right?
I don't know if there's an actual writing example from Tolkien...
I'd very much like to solve this :)
Thank you for your time! :)

Paul Strack Apr 01, 2017 (02:13)

+Marc Barceló Tost That's a good question. I haven't looked at the etymology of hosta specifically. However, based on it cognates, I don't think it would be spelled with thule. See khotsē:

http://eldamo.org/content/words/word-474284331.html

This is the primitive precursor to N. hoth "host", and likely developed from khotʰsē. The aspirated [tʰ] would have lost its aspiration before [s] very early, well before it could develop into [θ]. My guess is that hosta would have developed from a similar primitive form khotʰsā.

Furthermore, it seems that in Tolkien's later writings he replaced the root KHOTH with KHOT:

eldamo.org - Eldamo : Primitive Elvish : KHOT

So, I think the spelling of hosta with silme is correct.

Lúthien Merilin Apr 01, 2017 (11:54)

+Paul Strack - thanks, I understand.
I'd be happy to leave all the variations as they are, it's just that I'm afraid that it will result in disjointed entries. You see, the problem is that in XML the position of a certain element is meaningful by itself, whether or not the key attributes match completely. In a relational model, however, there is no such thing as position and we have to rely on the attribute values only.
This might somewhat diminish the consistency in the end, though it can always later be amended by hand. Ah well :)

Lúthien Merilin Apr 02, 2017 (17:40)

+Paul Strack I'm sorry to bother you again, but I have one more question (I'm starting to feel like that inspector Columbo ;) ) ..

Is there an essential difference between phonetic rules as element (see http://eldamo.org/general/eldamo-schema.html#element_rule) and phonetic rules as contained within the Word element (see http://eldamo.org/general/eldamo-schema.html#element_word - "these elements also represent other lexicon entries such as phonetic rules or grammatical entries"), and if yes, how would you describe that?

It's as if Word entries with speech attribute = "phonetic-rule" are like parent entities to rules of the Rule-as-element type that can also contain documentation-elements like Notes; whereas Word entries with speech attributes phonetic-group, phonetics or phoneme seem to contain the building blocks of overviews of phonetics (e.g. tables, lists, etcetera).
Is that correct or am I way off?
eldamo.org - XML Schema Documentation

Paul Strack Apr 02, 2017 (18:16)

Yes, there are differences. Lets look at the Ilkorin Phonetics page as an example of how the data model is displayed:

eldamo.org - Eldamo : Ilkorin/Doriathrin Phonetics

The "phoneme" entries are listed at the top. They appear within the Phonemes tables of each language. The phonemes also appear as elements within different phonetic-groups. These groups control how the Phonemes tables are organized.

For example, consider these two Ilkorin phonetic-groups:





The voiced-stops make up the 2nd row of the phonetic table (phone-row="2"), while the k-series is the 3rd column (phone-col="3").

Similarly, the "vowels", "long-vowels" and "diphthongs" groups determine those rows, while everything left over gets dumped into the "others" section.

There may also be other groups not reflected in the tables. For example the "final-consonants" groups in Adûnaic describes which consonants are allowed to be final.

http://eldamo.org/content/words/word-2391800373.html

An even better example would be a "final-consonants" group for Quenya, but embarrassingly enough I haven't defined it, yet.

Not every language has this level of analysis: look at the mess in Telerin, where I haven't even begun to look at phonetics:

http://eldamo.org/content/phonetic-indexes/phonetics-t.html

The "phonetic-rule" word entries are for describing a specific phonetic change within the history of the languages. I discuss it and the other related XML elements in more detail above.

Finally, the "phonetics" word entries are for items that don't fit neatly into the "phoneme", "phonetic-group" and "phonetic-rule" categories. The vast majority of these entries are placeholders for phonetic information that I haven't analyzed yet. The languages with complete phonetic analysis, like Adûnaic and Ilkorin, don't have any "phonetics" entries left over, because I've converted all that information into other entries.

Lúthien Merilin Apr 02, 2017 (23:24)

Thank you! I'm amazed at the depth of your effort.

You know, I don't how you work on the Eldamo data set; but especially if I can unravel the word entities into phonological / grammatical / lexical structures in a sensible manner, it should be possible to create editors / interfaces for the phonological and grammatical data as well.

Until now we've looked at this as a lexical project with the phonological & grammatical data as references and resources, but if you think it would be useful, why not broaden the scope?

I would supposedly need some input for that though. I'm not a linguist, I just happen to love (especially) the Sindarin language. It's a user perspective, not a scholarly one; I think, for instance: "what would be just what I needed if I want to write a poem in Sindarin?" - I have no idea what would be needed to enter phonological changes or what have you. I mean that a very practical sense, as when specifying the functionality of an administrative system that talks to a database: what are the entities you work on, where can you values from other entities, how can you link them up - can you add free-form data, and where? - etcetera - like designing an online form or whatever.

We'll need something like that in any case to add lexical entries by hand.

Since I am trying to keep the richness of eldamo.xml fully intact, the relational model should in principle be able to capture anything that the XML schema can capture. But as you also said, the XML schema has grown organically over some time and it might be that it is not optimal.

So, if you are interested, it may be worthwhile to also look at the final model with this in mind: do you maybe have any ideas how you would want to improve on the Edamo data model? It would be great if I could take those into account.

- luthien




Lúthien Merilin Apr 02, 2017 (23:36)

This is indeed a great illustration, addressing exactly what confused me:
eldamo.org - Eldamo : Ilkorin/Doriathrin Phonetics

From my point of view (a rather bottom-up approach) I was only seeing "rules" (and, of course, rule-start and rule-examples) but from that multiplicity it was not evident that "rules" are the main ingredients of two very different structures: at one hand, pages such as that Ilkorin Phonetics page containing a branching tree of 'phonemes', 'phonetic rules' and 'phonetic groups'; and at the other hand: the elements contained within lexical pages or elements.
They look so similar bottom-up! :)

This is very helpful. Thanks.




Paul Strack Apr 03, 2017 (00:10)

+Lúthien Merilin Hmm. Those are good questions.

I've considered adding some sort of admin UI on top of Eldamo for data entry at various points in time, but so far have decided it's not worth the effort. It's mostly me working with the raw data, and I am familiar enough with the data model that I mostly just directly manipulate the XML.

Sometimes I work directly with the Eldamo XML file itself. Sometimes I enter data into a spreadsheet and then run it through various transforms to convert it into XML. I don't have any standard approach to this. I tend to produce custom spreadsheets and transforms based on the nature of the data I am working with. There are different optimal ways to record phonetic tables, grammatical charts and word lists, and none of them are perfect. I often hand tweak the resulting XML before adding it to Eldamo.

One general approach that I have found useful is that I've build a generic data import mechanism, which I call the "merge.xml". My normal data entry process these days tends to be:

Custom Spreadsheet >> merge.xml >> eldamo.xml

When you do work out your relational data model, some kind of bulk data import/export feature will be really useful. Often you are going to want to do mass imports or bulk edits, and an import/export feature will be vital for that.

As for additional changes to the model, there isn't much I am planning to add to the Eldamo data structures at this point in time. The model has reached the point where it is "good enough" for my purposes. I have data structures for all the things I care about: words, grammar, phonetics and conceptual development. The only thing missing that I might add someday is a more accurate dating system for references beyond my crude "early, middle, late" divisions.

As a way of verifying the Eldamo model, I have managed to "finish" several of the minor languages, which makes me feel comfortable that I have what I need for the far-distant day when I finish data-entry for Quenya and Sindarin and get serious about analysis. I could probably structure some things better, but right now I'd prefer to focus on the lengthier task of data collection.

To be honest, the most I've done recently in restructuring the model has come out of these conversations with you and +Severin Zahler. The fact fact that you two are trying to build systems with similar information but in a relational database structure lets me exercise the Eldamo data model in new ways, and gives me more ideas for improving it.

Lúthien Merilin Apr 03, 2017 (00:54)

ok, one more .. it seems I misunderstood the function of the l= and v= attributes in elements under Word and / or Ref like Deriv, Cognate, Inflect etcetera. It typically says with those elements something like this:

As a child of a word element, uses a word reference (@l + @v)

which I interpreted like pointing to a Form element and a Language element - not realising that they together refer to another Word entry.

This makes all the elements of that type, if they are children of a Word, into either links to another Word - varieties, and if they're children of a Ref, into - yeah, what actually?

Anyhow, it seems I've got to reconsider that "Linked" entity in my last model ;)


But something else ... while looking at this, I came across a page that confused me considerably more: I'm looking at this page: http://eldamo.org/content/words/word-3308413061.html

(eldamo.org - Eldamo : Adûnaic : adûn)

and then at the entry in the xml file:

page-id="3308413061">
(...)

... the strange thing is that there seems to be a LOT more in that HTML page than I can account for in the XML entry. How is that possible?

Is part of that page maybe compiled from information in other XML entries that refer to this entry? As if they act like HTML iframes, as it were?

I think I best call it a day .. maybe it makes sense tomorrow again :$

Lúthien Merilin Apr 03, 2017 (01:08)

+Paul Strack thanks for the elaboration! Well, should you change your mind and require some sort of entry-application, we can always add something like that.

Re. Changes of the model: since you mentioned it yourself, how would you feel about making he distinction between phonological, grammatical and lexical entries more manifest? I think it might help the model to be more understandable, but maybe the current state allows for modelling some relationships between them that would otherwise be harder?

Paul Strack Apr 03, 2017 (02:40)

This post is to answer both of your previous posts.

1) As you have noticed, there is not a direct correspondence between pages and word entry data in the model. In particular, the page usually on shows relationships in more than one direction. For example Adûnakhôr has an element relationship to the word adûn.



Adûnakhôr ⇐ adûn

However, this also implies an inverse "element in" relationship between adûn and Adûnakhôr. That relationship is not directly expressed in the data model, but is shown on the page:

adûn ⇒ Adûnakhôr

The same is true of other relationships as well: derivations, changes, etc. The page rendering logic parses and shows all the relationships that entry is involved in, both as a source and as a target.

2) As you seem to have surmised, yes, there is a limit to what kinds of entities can be the source and target of relationships. There are two primary entity types in the model: words and references. Words can related to other words, and references to other references. The child elements of a reference express its relationships to other references, and the child elements of a word express its relationships to other words (with the exception of children).

There are some exceptions to this rule: the phonetic relationships are more complex, and there are isolated child elements such as an that can establish relationships to more abstract things like examples of particular grammatical inflections, but the general rule is a good starting point.

That is why, when I had this discussion with +Severin Zahler, I suggested he build his model around words, references and generic relationships as a base line. Rather than have distinct join tables for every kind of relationship, you can have generic relationship with a "type" field expressing the relationship type: inflection, element, change, etc. That's what I would do if I were rebuilding the model in a relational DB.

As you noted, in the XML model, all the relationship keys are joint keys. A word-link uses @l and @v as its keys and a ref-link uses @v and @source. In a relational model, of course, you'd just use the ID fields, which Eldamo doesn't have.

3) Yes, I see some value splitting out grammatical and phonetics items into a separate type of entity, because the relationships they are involved in are quite distinct from the word and reference entities. I likely won't bother in Eldamo itself, since it's too much work for too little payoff, but it might be worth doing if I were starting over from scratch.

To be honest, I still recommend ignoring grammatical and phonetic entities to start with. Modeling those correctly is hard. I went down several abandoned paths before settling on what I use today. You should expect to have several false starts before you get them right.

Lúthien Merilin Apr 03, 2017 (15:21)

Re. Rather than have distinct join tables for every kind of relationship, you can have generic relationship with a "type" field expressing the relationship type: inflection, element, change, etc. That's what I would do if I were rebuilding the model in a relational DB

Indeed, that’s why I had the LINKED entity (specified by LINKEDTYPE) though I arrived there not so much from understanding the model, but from noting similarities between the elements in case.
My problem was that I misinterpreted lines like this one, for the Deriv element:

"Indicates this form is derived etymologically from the linked form. As a child of a word element, uses a word reference (@l + @v);"

where I understood that form (this form, linked form) meant the same thing as for the Word element:

"The @l attribute indicates the word’s language and the @v attribute the base word form"

in other words, that Deriv, Inflect, etcetera basically linked @v type “forms” together. Apparently I held on too strong on my understanding of whatever I thought a Form was, but I think I got it right now :)

Re. To be honest, I still recommend ignoring grammatical and phonetic entities to start with.

That makes a lot of sense and would probably be the most sensible thing to do. But I just have to try, you see!
It’s also that I’m not completely certain what I can leave out, as the dividing line between lexical vs. grammatical / phonological is somewhat blurry.

Lúthien Merilin Apr 03, 2017 (21:44)

+Paul Strack, I hope this one is easy:

Of the Deriv element, it says: "As a child of a word element, uses a word reference (@l + @v); as a child of a ref element, uses a source reference (@source)."

And sure, I tried matching a few Word->Deriv elements to other Word instance based on their shared (@l + @v) and that seems to work great.

Of course, the Ref-Deriv elements also have a @v attribute, though that is, according to that documentation paragraph, not used as a reference; the @source attribute is used for that.

Nonetheless, in the examples that I tried, the @v attribute of Ref-Deriv elements is also referring to another element. For instance, take http://eldamo.org/content/words/word-603669853.html

The corresponding xml snippet is:





where the Deriv element is rendered on that the Eldamo page as the first entry under Derivations. The "khabnā" @v of that Deriv element is linked to http://eldamo.org/content/words/word-2495185209.html

I was wondering where that URL comes from? After all, in the xml file, he Deriv element only contains the @v attribute; it doesn't have the @l part that would uniquely identify a Word entry.

I tried to figure out if it maybe uses the l="q" of the parent Word; but there is no Word entry with l="q" v="khabnā". But there is this one:

page-id="2495185209">

indeed, the one that that Deriv element's URL links to.

But how is that Word found, while the language - half of its identifier - is not known?

EDIT - it just occurred to me that it might be found through the language hierarchy, because Quenya->Ancient Quenya->Primitive Elvish ...?
eldamo.org - Eldamo : Quenya : hamna

Paul Strack Apr 03, 2017 (22:29)

Erg, yes, your guess is correct. There is some old matching logic in the lookup that tries to guess the @l when it is missing. I thought I'd gotten rid of all that and made things stricter, but it looks like I missed a few cases.

I will fix the XSD and see if I can clean up the data for the next release.

Lúthien Merilin Apr 03, 2017 (23:41)

And here's the 10th draft of the model!

In this one I corrected the error in the previous version where the Word & Ref child elements linked rather to Forms than to Words.
I was thinking of temporarily storing the @v + @l attributes in the LINKED entity instead of trying to lookup the Entry (Word) id right away. It will be straightforward to retrieve the Entry id later on, based on those two columns.

This way I can also get rid of the BEFORE table, and I ironed a few other wrinkles out as well.


https://plus.google.com/photos/...

Lúthien Merilin Apr 06, 2017 (21:12)

I found a small inconsistency in the data: in about a dozen places there is a reference to a language with ID "sol":








(corresponding to page http://eldamo.org/content/words/word-573915711.html)

while there is no language entry with that id. Maybe they are meant to refer to this one?


eldamo.org - Eldamo : Middle Telerin : Elu

Paul Strack Apr 07, 2017 (06:10)

That's not an error, that's deliberate. There are some languages that are small enough in attested words that they are easier to handle in groups with other languages. Thus, I treat Solosimpe and Early Telerin as one language instead of two, since there are not enough Solosimpe words to treat separately. Similarly, I treat Ilkorin and Doriathrin together.

In those cases, there is an @l attribute on the element attested as members of the secondary language. The element only list the primary languages, and links only to elements.

Lúthien Merilin Apr 07, 2017 (11:40)

Well, I understand the reason for putting the languages in one group ... what I mean is: how then can you connect l="sol" in a ref element to "Solosimpi/Early Telerin" when the ID of "Solosimpi/Early Telerin" is "et" and not "sol"?

In other words: how does the code know that "sol" is in the same group as "et"?

(Edit)
Hmm is it maybe that in those cases I should not consider l="sol" to be an identifier that's supposed to match with something, but merely as a label?

If so, is there any "formal" way to decide when an attribute is to be understood as 'label only', so that it needn't be matched with something defined elsewhere?

Paul Strack Apr 07, 2017 (15:42)

It is because the sol refs are inside an et word. The @l attribute on a ref is, as you say, a "label only".

This isn't the only reason why a ref might have a different @l attribute than the word containing it. There are a couple cases where a Quenya word was mislabeled Noldorin or vice versa. In those cases, the ref has an @l="n" even though the word containing it is @l="mq".

Put another way, the vast majority of refs don't have an @l attribute and are assumed to be the same language as the word containing them. The few refs that have an @l attribute have a different language label in Tolkien's writing, but are classified under a different language by Eldamo for various reasons.

Lúthien Merilin Apr 07, 2017 (17:04)

this is probably the final data model .. some other changes that came out of test-parsing eldamo.xml:

- merged ENTRY_TYPE, DOC_TYPE and SOURCE_TYPE into a generic TYPE entity (uses parent-id to categorise the rows);
- merged RULE-EXAMPLE and DERIV-RULESEQUENCE into RULESEQUENCE (this includes the original rule-start element);
- took the two GRAMMARTYPE columns out of LINKED and added a n:m LINKED_GRAMMAR intersection entity

The only thing left to test is the rule-related stuff; if that works OK, I can parse the whole set and insert it in the database.


https://plus.google.com/photos/...

Lúthien Merilin Apr 09, 2017 (11:29)

Right! I finished parsing the data this morning. Some very minor data model edits came up along the way, so here's another 'final' version.

The database DDL + data insert script is way too large to attach; I'll create a Github repo for all that asap. But if anyone wants to have a look at it I can post it on we transfer.com

https://plus.google.com/photos/...

Lúthien Merilin Apr 09, 2017 (11:35)

+Paul Strack - especially now that I see that the core tables have around 70.000 rows, I am all the more impressed by how much effort you put into creating Eldamo. It's not just the quality, but the sheer size borders on heroic. My deepest respect!

Paul Strack Apr 09, 2017 (18:29)

+Lúthien Merilin Thank you. I have been working on the damn thing for eight and a half years now, and I'm still not even done with the data entry :)

I would like to see the DDL and the script, but I can wait until you get Github set up. I don't have time to look it over it this weekend.

Lúthien Merilin Apr 10, 2017 (10:37)

Eight and a half years ... well, that's indeed about the order of magnitude those row counts made me think of. Impressive! Very few manage such a sustained effort.


Re. Those l attributes: Thanks, understand. I made it so that for the refs without a language attribute, it takes the one from the parent entry because the language ID column was needed there anyhow for the exceptions.

I found nine other unlinked l= attributes in the Example element:
bel
dor
dor ilk
edan
eon
fal
ln
lon
oss

While some of those are pretty obvious, I could not establish from the data itself what those ones mean, nor a parent-language for any of those.

Like with sol, I made the names of those languages equal to their shortcode (mnemonic) so that they will be displayed as such.

Btw, just to make sure: I only brought this up because I approach attributes like that as a kind of foreign key, which is how I could extract the relational entities. They have to match with something.
It's not that I thought this was an issue with the data itself. It's that relational databases are very inflexible, they need everything to match to something.