Post UzVZRJx1mcp

Severin Zahler Jun 27, 2017 (14:42)

+Paul Strack

eldamo-data review

All queries have been made on the data of eldamo v0.5.5., but as you probably have not changed anything of the existing data these findings should still apply.
As promised, here a pretty thorough analysis on the topics of "marks on glosses":

Question marks [ ? ] in glosses:
There's a total of 301 glosses with one or more question marks within them.

The largest group, and thus I'd say the "most correct and consistent" layout is where the question mark appears in leading position; this affects 209 of these glosses.
10 of those are nothing but a question mark.
1 is most likely a typo; it has a space between the question mark and the gloss:

Next up there's of course some with a question mark in trailing position. Most are "correct" in the sense that the question mark is there because the gloss belongs to a sentence which is a question. It's noteworthy though that there are many conlang words that miss the "?" while the gloss has one.

However, some of the glosses with trailing question marks are marks for uncertainty of the gloss. It'd be nice if you could change those so that they are in leading position instead as it is not possible to distinct these two cases reliably programmatically:

Now left are 56 entries that have the question mark neither in trailing nor leading position.

First off there's a small group of glosses which have a leading "?", but all or parts of the gloss, including the question mark, are enclosed in brackets. There's 16 of those:

Brackets may surely make sense to signify that parts of the gloss are uncertain, but for those that are completely contained withing brackets it seems a bit redundant to have both brackets and question mark.

Then there's three entries that also have the question mark in leading position, but are preceded by "(lit.)": page-id's 1766902045, 98311109, 1622754763

The remaining 37 are of various nature, but very often the pattern "(?uncertain part of gloss)" appears, so this is also fairly consistent. I am not sure whether the occasional square brackets are intentionally different from round brackets, but I believe there could be some more unification on these entries:

Are you interested in fixing such technical things? I'll continue with other marks on glosses for the rest of today, but tell me whether I should continue posting these or whether I should just sort them out on my end.
Eldamo : Sindarin : Nenning
A river in western Beleriand flowing past the haven of Eglarest (S/120). The meaning of this name is unclear, but it appears to contain nen “water” (SA/nen). Conceptual Development: In Silmarillion drafts from the 1930s, this river first appeared as Eldor, later revised to Eglor (SM/227). In the Etymologies ...

Severin Zahler Jun 27, 2017 (14:54)

Interrobangs [ ‽ ] in glosses ("marked by author with '?'")

These could probably follow the same logic as the regular question marks (in leading position if applied to entire gloss, else within round brackets with uncertain gloss part).

Severin Zahler Jun 27, 2017 (15:02)

Archaic mark [ † ] in glosses

47 Entries have this mark and all but 3 have it consistently in leading position.

Exceptions: - Eldamo : Gnomish : gedweth

Severin Zahler Jun 27, 2017 (15:16)

The marks [ ** ], [ ! ], [ ^ ], [ # ] and [ | ] do not appear in glosses or ar never used as marks.

Plausible neologism marks in glosses [ * ]

There's a whopping 1187 glosses with this mark, however with a very high consistency:

- 1125 have the mark in leading position.
- 57 have the mark in leading position, but with (lit.) preceding.
- 4 have it in trailing position: Id's 2372957001, 3229989959, 3684301635, 3027919603
- 1 has it within: - Eldamo : Quenya : tengwië

Severin Zahler Jun 27, 2017 (16:26)


While looking through the gloss marks I noticed that some verbs are glossed with "to verb" and some without. So my next checkups are on verbs!

There are 4319 glosses connected. 1905 of those begin with "to ", so probably quite a lot of work to fix this inconsistency. I have to note though, that I have splitted the glosses. So what you may have as i.e. "to fly, soar" in the XML will yield "to fly" and "soar" on my end. So it may be rather something I need to fix on my end. If you'd like me to get you the full list of either the verbs with or without "to" nevertheless, please tell me!

But there's more I can say (note however that these are mostly from looking at the data and not searching for certain things, so no guarantee I caught all of these kinds)

Words that are marked as verbs but don't seem to be: - Eldamo : Early Noldorin : or-

These ones seem to have been fixed by now, as on the online version its different than in the 0.5.5. XML, where they're verbs:

Glosses contained within single quotes
It seems to be intentional, but I would not be able to guess the reason:

Glosses that have ", etc."
Not sure what the idea behind such glosses are (could you explain?) I end up having glosses that just spell "etc." :P

These ones have a latin word or abbreviation as gloss :O

Minor problem with ", if":
One of the inflections contains ", if" in the sense of a conditional clause, however this comma is being mistaken as gloss separator by my program ;)

Minor gloss derp: "or" instead of comma:

Whereever there was no gloss given I tried fetching one from child elements, such as elements. May not have been a good idea as it brought me quite some unwanted glosses that are not in infinitive. However I see whether I can generate a list of all words that miss a gloss but have a gloss on a child element, so that you could enter a reconstructed gloss, i.e. on this word:
But for that I got to go into the Java code again, so that may still take a little while until I get to it.

Sorry for the spam guys, but I hope it's for the best of eldamo!

Severin Zahler Jun 27, 2017 (16:56)

Next (there's still lots I can check :S) I investigated on the use of the different bracket types in glosses:

Square brackets [ ] in glosses (count: 246)
The usage of these seems to be rather inconsistent. What I see often is situations where in other words round brackets are used, and then the situation where the square brackets mark parts that are not explicitly to be found in the conlang text.

I'll point out some exemplary words which I think are strange, but I won't identify every single one, unless +Paul Strack really is interested in making the use of brackets as consistent as possible. The current use of brackets does not cause issues for understanding things, it just makes a programmatical analysis / sorting / searching quite a bit trickier. I aplogize if it turns out that I am completely mistaken and all of these brackets are already used in a very distinct way!

Situations where I'd use square brackets:
- Explanatory glosses, i.e. when you have no actual gloss but can give a hint what it means. Examples:

- Partial glosses which are not appearing as part of the conlang word:

Examples of where I'd use round brackets instead:
- Uncertain partial glosses:

- Concretions

Where I'd drop the brackets altogether because the gloss part is also part of the conlang word:

Then I am just generally confused by this: - Eldamo : Old Noldorin : nurra

So, and that's all for today! More to come, given Paul wants to look into these sorts of things ;)

Paul Strack Jun 28, 2017 (07:53)

That's a lot to look at. I will go through it this weekend and get back to you.

And yes, I am interested in this kind of cleanup/analysis.

Severin Zahler Jun 28, 2017 (08:48)

Alright, thanks for the heads up, so I will continue with this for a while then ;)
Here's the previous one about square brackets, I exported it as XML now and not as CSV as it is imo easier to look through than a CSV (tell me if you'd prefer another format, phpMyAdmin can do many). - [XML] eldamo glosses with square brackets XML -

Then the curly brackets in glosses:
There's only four, in some cases other brackets may be more fitting, ie. where the brackets do not indicate a change

I also checked whether all brackets are balanced (i.e. whether theres any gloss that has an opening but no closing round/square bracket or vice versa), and I could not find anything suspicious.

Paul Strack Jul 02, 2017 (05:02)

OK, I finally had time to look through this. You've done a lot of good analysis here.

Unfortunately, I think you are trying to extract a level of meaning from the source data that simply isn't justifiable. At that point in time, the @gloss attributes simply aren't intended to be interpreted by anything other than a human reader.

There are a lot of inconsistencies in the formats of the gloss data, for a variety of reasons.

1) The ref/@gloss attributes are intended to be a relatively faithful reproduction of the original gloss from Tolkien. In some cases I have altered the punctuation for clarity, but mostly I tried to record what Tolkien wrote. These glosses cannot be interpreted as anything other than atomic units.

2) The word/@gloss attribute may eventually be machine parseable, but not in the current (late alpha) stage of the data model. I've only done a full editing pass on a handful of the smaller languages. The larger languages, particular Quenya and Sindarin, have yet only light and preliminary editing passes.

I have to balance my time between making the current, alpha model useful and pushing forward on completing the data entry. Some things like gloss consistency are just going to have to wait. Cleaning up the glosses in the data model is probably going to be one of the last things I do in the model in the late beta stage of its development. It is the sort of thing I want to do only after all the data entry is done and I have a firm grasp of the conceptual, phonetic and grammatical developments of the languages.

As you push forward in your own analysis, you are going to reach a point where the data in Eldamo is insufficient for your purposes. At that point, you are going to have to go beyond Eldamo, engage directly with the source material and do your own analysis. You may be reaching that point at least in some areas.

And I do hope that you do go beyond the Eldamo data set. Having other alternate Elvish data models out their will make cross-model analysis possible, which can greatly improve the quality of all the data models. I got a lot of value out of comparing my data set to those of David Giraudeau and the older Elvish dictionaries.

Anyway, I will try to respond to some of your specific comments above.

Paul Strack Jul 02, 2017 (05:14)

Brackets, Braces and Parenthesis

In general, brackets [ ] in glosses represent editorial additions, items added to the gloss that are not present in the original. I use those when making partial additions to glosses, as opposed to marking complete unattested glosses with a “*”.

Braces { } in glosses represent modifications or deletions, and following the conventions in Parma Eldalamberon. For example:

“now supposing you asked me, a thing unlikely {or ridiculous} to suppose”

This indicates that “{or ridiculous}” was present in the original text but deleted by Tolkien.

Parenthesis appear for a variety of reasons. Sometimes they represent parenthesis in the original. Something they represent brackets in the original, which I changed to parenthesis to make them distinct from editorial additions. Sometimes I added parenthesis of my own for clarity.

One particular use of parenthesis is for speculative glosses for illegible words in the original text. For example: - Eldamo : Ilkorin/Doriathrin : espalass

espalass “foaming (?fall)”

Here the second word in the original is hard to make out, so the gloss is speculative (generally from the editors of the original material rather than my own speculations).

Paul Strack Jul 02, 2017 (05:22)

Glosses with quotes, non-English words and strange formats

These are mostly copied from Tolkien's writing. For example:

hug- “futuere, *to copulate”

Tolkien's original gloss was not English. The English translation is unattested (usually, but not always, as provided by the editors of the original article).

Similarly, quotes are copied from the original, though in some cases I changed double quotes to single quotes: - Eldamo : Early Quenya : tyustyukta-

chew the cud; reflect, ‘reminisce’ (quotes from Tolkien).

Similar for trailing sentences or lists ending in “etc.”

Paul Strack Jul 02, 2017 (05:27)

Verb Glosses

Verb glosses in ref elements are copied from Tolkien and may or may not have a leading “to”. I eventually intend verb glosses in word elements to always use the English infinitive form “to verb”, but since I haven't finished editing, these glosses are not yet consistent.

Sometimes I use “;” to separate conceptually distinct groups of glosses. In those cases, I intend to put another “to” at the beginning of the new group, in the same I intend to put a dash after the verb to indicate that it is a stem form.

naitya- “to damage, hurt; to put to shame, abuse” - Eldamo : Early Quenya : naitya-

Paul Strack Jul 02, 2017 (05:38)

Odd placement of marks like “†”, “?” and “‽”

Some of those are copied from Tolkien’s original gloss. Some of them are stylistic on my part (I think the abnormal placement is clearer when a human reads it). Some of them are simply errors that I will get around to cleaning up at some point.

The gloss for Nenning, for example is deliberate:

Nenning “? Water”

The intent is that the gloss is probably “[Something] Water”, but what that “[Something]” might be is unclear. Maybe it should be “Water ?” instead, but it isn't a typo, given that the adjectival element usually follows in Sindarin.

I think that covers the majority of your questions. - Eldamo : Sindarin : Nenning

Paul Strack Jul 02, 2017 (05:43)

One more, nurra “[ǝ]”: - Eldamo : Old Noldorin : nurra

That one is confusing, and I had to look it up in the source material. It is name of an archaic tengwa, representing the schwa sound [ǝ]. That's the only gloss provided.

Severin Zahler Jul 05, 2017 (08:22)

Thank you a ton for your work! Don't worry, I won't need to rely on you "fixing" or normalising all these things; the two things I want to achieve by posting these findings is on one hand to point out things that may be indeed wrong and which you'd like to fix and secondly to kinda find out which things I should sort out on my end, which ones I may expect to be changed by you, and which cases I should leave as they are.

I am now starting to implement a new bunch of things and then send your XML through my code once more. Things I'll change in the next iteration so far include:
- Inclusion of language relations (so that my search form can display all the languages somewhat like you have it
- Verb glosses: The code will check whether it has a "to" in front, if not it will be added
- I will split the marks off word glosses, but as you suggest will leave the ref glosses untouched. I still got to see which cases are common and consistent enough to treat them programmatically (I am also considering having the program ask for cases that are unclear, so far no user input is required for the program)
- Where the gloss was missing I tried inheriting a gloss of a child element; I will need to revise the situation where a gloss of an child gets inherited as I have a few words that have a gloss thats not in the base form.
- If glosses from word or ref type "phrase" or "text" contain commas or semicolons, the gloss will not be splitted.
- Maybe more.... - Eldamo : Language Index

Severin Zahler Jul 05, 2017 (14:14)

Btw: I forgot to mention, the queries I did were all only on -elements, none were on references and neither the glosses of references were queried.

Severin Zahler Jul 05, 2017 (16:42)

Well, looks like i fooled myself a bit: Tried fixing the error that I am inheriting glosses from elements to be used if elements are missing glosses. Turns out I actually did never do that...

So instead I'd consider this somewhat a derp, in the XML you have a "tekta-nt", glossed "drew, wrote". It contains an element which gives "tekta" as base form. tekta-nt is contained in a "tekta", but it is glossed "drew, wrote", but logically it should be glossed "*draw, *write":

Paul Strack Jul 08, 2017 (06:24)

+Severin Zahler You're right, that is a bug. I already have code that does what you describe above of using the ref gloss if the word gloss hasn't been defined yet. It looks like there are errors in that logic. I will fix this in the next release.

Severin Zahler Jul 10, 2017 (08:36)

Okay so just fyi, here's which cases I distinguished on marked glosses:

- "?Gloss" --> "Gloss" (except "? Gloss")
- "(?Gloss)" --> "Gloss"
- "Glosspart (?Glosspart)" remains unchanged
- "Gloss?" --> "Gloss", IF speech is != "phrase", else remains unchanged
- "(lit.) ?Gloss" --> "(lit.) Gloss"

- "‽Gloss" --> "Gloss" (except "‽ Gloss")
- "Gloss‽" --> "Gloss" (always)

- "†Gloss" --> "Gloss"
- "Gloss†" --> "Gloss"

- "*Gloss" --> "Gloss"
- "(lit.) Gloss" --> "(lit.) Gloss"
- "Gloss" --> "Gloss"

All extracted marks are stored seperately (translations are stored as entry in the table "has_relation" which contains the ID of the elvish and the english word, and the marks are stored in the table "relation_has_mark" which contains a relation-ID and a mark-ID.

This yields the following list of glosses (on conlang words, ref's are untouched), that still contain a mark:

Especially noteworthy: I now added the rule, that "phrase" and "text"-glosses are no longer split on commas. While it ensures that I do not store sentence-fragments there seem to be some sentences with multiple, comma-separated glosses, between which I cannot distinguish. Thus i.e. the gloss "in the flowing sea, (lit.) *in sea-streams" keeps the mark while it had been split off if the two glosses would have been separated. I'll keep it like this though as my form anyway won't be intended to search for phrases and I prefer this solution over gloss-fragments.

Then as you see in the picture my program marks the source "PEE/17.35" as not resolvable. I guess I should add this to the excpetional source formats where I already have types "MS/01.04" and "Plotz/11.2".

Furthermore I am now checking all verb-glosses whether they start with "to ", and if not I add that. So far I have not added it to 's glosses.

Also I now added the relations to the constructed languages.

With that I have worked off all the changes I wanted to do and I'll be porting the data to the database again today (gonna take a while...) So probably tomorrow I'll make some more queries to check on the consistency of the data and maybe point out some more things that may be of interest for you.