infpTCicoFy — G+ LoME Archive

Severin Zahler Jan 10, 2017 (10:37)

So, after the christmas break I am back working on my conlang database project. Currently I am developping a java program which dissects the data from eldamo and sorts and arranges it into the tables, and ultimatively will directly load it into the database. As an intermediate developping / debugging step I am writing the tables to csv files for easy verification of the code.

+Paul Strack As I am doing this I am stumbling across some of the (so far) very few errors of the eldamo file. I just extracted the "speech" attribute of all word pages (in order to create a word type table), and there's a few things I found odd / noteworthy:

- There's a few combined types which appear in two different orders, I presume these are not intentionally different from each other:
-- "n adv" vs. "adv n"
-- "n adj" vs. "adj n"
-- "conj adv" vs. "adv conj"
-- "adv prep" vs "prep adv"
-- "adv interj" vs. "interj adv"
- Once "card" appears, in all other occurrences you did put "cardinal": http://eldamo.org/content/words/word-4008647877.html
- Similar for "suf" vs. "suffix", here the latter one appears once only: http://eldamo.org/content/words/word-3792660851.html
- Similar for "pref" vs. "prefix", prefix is used 4 times in total.
- There's one occurrence of "v" instead of "vb": http://eldamo.org/content/words/word-529403071.html
- For G. siriol "flowing" some major mix up of the fields has happened; "sindi" appears as word and "siriol" as word type: http://eldamo.org/content/words/word-2199144701.html
- There's two words who are missing the speech attribute, even though the xsd file states there always should be one:
-- http://eldamo.org/content/words/word-4199786153.html
-- http://eldamo.org/content/words/word-2768164367.html

I hope this helps to fix some things :) For your convenience, here's the entire output I got from my program: http://pastebin.com/5kSJpf6j

I will also try to add a few additional checks to my code which may help to find a few more of these minor flaws :)

Eldamo : Early Quenya : atwen

ᴱQ. atwen card. “20”. Reference ✧ QL/033.2401 ✧ “20”. Derivations. < ᴱ√ATA¹ “dual” ✧ QL/033.2201.

Severin Zahler Jan 11, 2017 (14:09)

+Paul Strack there's multiple language codes used in the tags which are not defined in a tag.

These are:
- "bel"
- "dor"
- "fal"
- "edan"
- "sol"
- "ln"
- "eon"
- "lon"
- "oss"

(All language tags used in the tags exist though)

Severin Zahler Jan 11, 2017 (16:25)

Next up there's a few sources with prefixes that are not mentioned in a tag:

- LT/192.3006, duplicate to another reference which only differs with having a valid source (LT1/192.3006)
- LT/132.0212, this one has no valid duplicate refernce though.
- VT32/07.1106: Unique case, don't have VT on me to check whether the source is erronous or whether VT32 is just missing in the sources list.
- PEE/17.35, unique case

Paul Strack Jan 12, 2017 (02:06)

+Severin Zahler Thanks for the feedback. I haven't work on Eldamo for ... an embarrassingly long time. I've been obsessed with other projects in 2016, but hope to get back to working on Eldamo soon.

I haven't done validations on the parts of speech for a while, so the XSD specifications and the XML data may be out of sync. I will clean it when I have time. I added a github issue so I won't forget:

github.com - Fix various data validation issues · Issue #7 · pfstrack/eldamo

Regarding the ref language codes that do not appear in the languages list, those are references that are marked as one language that I put under words classified as a different language. There are various reasons why I did this.

"dor" and "fal" are dialects of Ilkorin (ilk), namely Doriathrin and Falathrim. "bel" is a late variation on Ilkorin that Tolkien labeled Beleriandric. "oss" and "edan" are variants of Danian (dan), namely Ossriandric and East Danian.

"sol" is Solosimpi which I lumped in with Early Telerin (et). "eon" is Early Old Noldorin which I lumped in with Early Noldorin though I may separate it in the future when I get around to analyzing it. "ln" is Late Noldorin, for words labeled "Noldorin" by Tolkien in the transitional period between Noldorin and Sindarin.

Severin Zahler Jan 12, 2017 (08:06)

Thanks for the reply! Will add those codes and the language names to my conlang table then; will think about an easy way to make these relations between the languages easily queryable.
The code "lon", is that "Late Old Noldorin"?

Paul Strack Jan 12, 2017 (15:43)

Yes, "lon" is Late Old Noldorin

Severin Zahler Jan 13, 2017 (17:16)

+Paul Strack I have compiled a list of all reference sources which have 5 or more digits in their line + word position part of the source. Are all of that, only some of that, or none of these erronous? If they are correct, what do the digits represent if there are five of them?

pastebin.com - SD/421.30061 SD/421.30063 SD/422.08101 SD/421.30065 SD/422.08071 SD/422.080 - Pastebin.com

Asking as I am storing all bits of information, i.e. page, line number and word position, in a separated form.

Also theres various sources with the format of i.e. PE13/155.9901-1, what does the "-1" mean?

Paul Strack Jan 14, 2017 (02:54)

The information after the page number is not entirely consistent. Mostly it means line number and word position, but I break that rule often. For example, Tolkien may designate a specific word as belonging to several languages, and I need multiple references from one word. Sometimes I split the elements of compound word into their own references. Sometimes I just add an extra digit and sometimes I use the -1 format. Over the years I've changed my mind several times.

Beyond the page number, I think the most I can guarantee is that the identifier should be unique.

Severin Zahler Jan 14, 2017 (11:07)

Ah now as you say it, and I sort the list, it actually is fairly obvious!
As I'd read it now the four first digits are line number + word position as usual and the last digit the variant, often just "5", or "1", and eventually higher digits if more variants exist.

A few single ones that still puzzle me though:
PE13/147.30310: This is the only one with digit "0" at the end.
QL/060.70ive: has letters instead of digits.
TI/310.0034091: has a lot of digits.
PE19/093.20 13: has a space in between.

But forgive me for labelling so many things of the data as probably erronous, I absolutely don't want to put you to shame, not at all! I just want to make sure I don't misinterprete this valuable data, and also want to probably help you with discovering the very few inconsistencies over this massive amount of data. Actually for that you seemingly arranged all of this manually it is incredibly consistent!

Paul Strack Jan 14, 2017 (17:43)

No need to apologize. I find it very interesting when people use data in unexpected ways, and this can reveal flaws in the raw data. The four examples were all typos, so I've fixed in the source data and the fix will be in the next release.

Severin Zahler Jan 16, 2017 (11:29)

+Paul Strack Thanks a lot for the update! Especially that you settled on uniform namings of the speech attribute makes a programmatic analysis quite a bit easier :)

Now that I've extended the code to specially look for the length of the line number + word position bit I stumbled over a handful of sources which only have 3 digits. As there are only very few of those I assume these are not intentional:

- RGEO/63.030 ("A Elbereth Gilthoniel")
- LotR/0429.001 ("Methedras")
- LR/317.002 ("iChúrinien")
- S/154.001 ("Grond")

Also I think you've overlooked my remark on a few source reference prefixes ("books", to avoid confusion) which are not listed; I'm not familiar with all abbreviations, so I can't wholly judge which are just missing from the sources list and which may be typos, here they are again (all appear uniquely):

- VT32/07.1106
- LT/192.3006
- LT/132.0212
- PEE/17.35

Paul Strack Jan 16, 2017 (18:17)

VT32 and PEE are both missing from the sources, so I've added them. The rest of the above are typos.

It takes a long time for a new full build of Eldamo, but I'm finding this process very fruitful, so I've checked a temporary data file with the modifications into Github here:

github.com - eldamo

[EDIT] When I clicked on the download button in my browser, it tried to load the xml file into my browser window, so you may want to right-click on the button and choose "Save As..."