G+ LoME Archive
Jan 10, 2017 (10:37)
So, after the christmas break I am back working on my conlang database project. Currently I am developping a java program which dissects the data from eldamo and sorts and arranges it into the tables, and ultimatively will directly load it into the database. As an intermediate developping / debugging step I am writing the tables to csv files for easy verification of the code.
As I am doing this I am stumbling across some of the (so far) very few errors of the eldamo file. I just extracted the "speech" attribute of all word pages (in order to create a word type table), and there's a few things I found odd / noteworthy:
- There's a few combined types which appear in two different orders, I presume these are not intentionally different from each other:
-- "n adv" vs. "adv n"
-- "n adj" vs. "adj n"
-- "conj adv" vs. "adv conj"
-- "adv prep" vs "prep adv"
-- "adv interj" vs. "interj adv"
- Once "card" appears, in all other occurrences you did put "cardinal":
- Similar for "suf" vs. "suffix", here the latter one appears once only:
- Similar for "pref" vs. "prefix", prefix is used 4 times in total.
- There's one occurrence of "v" instead of "vb":
- For G. siriol "flowing" some major mix up of the fields has happened; "sindi" appears as word and "siriol" as word type:
- There's two words who are missing the speech attribute, even though the xsd file states there always should be one:
I hope this helps to fix some things :) For your convenience, here's the entire output I got from my program:
I will also try to add a few additional checks to my code which may help to find a few more of these minor flaws :)
Eldamo : Early Quenya : atwen
ᴱQ. atwen card. “20”. Reference ✧ QL/033.2401 ✧ “20”. Derivations. < ᴱ√ATA¹ “dual” ✧ QL/033.2201.
Jan 11, 2017 (14:09)
there's multiple language codes used in the
tags which are not defined in a
(All language tags used in the
tags exist though)
Jan 11, 2017 (16:25)
Next up there's a few sources with prefixes that are not mentioned in a
- LT/192.3006, duplicate to another reference which only differs with having a valid source (LT1/192.3006)
- LT/132.0212, this one has no valid duplicate refernce though.
- VT32/07.1106: Unique case, don't have VT on me to check whether the source is erronous or whether VT32 is just missing in the sources list.
- PEE/17.35, unique case
Jan 12, 2017 (02:06)
Thanks for the feedback. I haven't work on Eldamo for ... an embarrassingly long time. I've been obsessed with other projects in 2016, but hope to get back to working on Eldamo soon.
I haven't done validations on the parts of speech for a while, so the XSD specifications and the XML data may be out of sync. I will clean it when I have time. I added a github issue so I won't forget:
github.com - Fix various data validation issues · Issue #7 · pfstrack/eldamo
Regarding the ref language codes that do not appear in the languages list, those are references that are marked as one language that I put under words classified as a different language. There are various reasons why I did this.
"dor" and "fal" are dialects of Ilkorin (ilk), namely Doriathrin and Falathrim. "bel" is a late variation on Ilkorin that Tolkien labeled Beleriandric. "oss" and "edan" are variants of Danian (dan), namely Ossriandric and East Danian.
"sol" is Solosimpi which I lumped in with Early Telerin (et). "eon" is Early Old Noldorin which I lumped in with Early Noldorin though I may separate it in the future when I get around to analyzing it. "ln" is Late Noldorin, for words labeled "Noldorin" by Tolkien in the transitional period between Noldorin and Sindarin.
Jan 12, 2017 (08:06)
Thanks for the reply! Will add those codes and the language names to my conlang table then; will think about an easy way to make these relations between the languages easily queryable.
The code "lon", is that "Late Old Noldorin"?
Jan 12, 2017 (15:43)
Yes, "lon" is Late Old Noldorin
Jan 13, 2017 (17:16)
I have compiled a list of all reference sources which have 5 or more digits in their line + word position part of the source. Are all of that, only some of that, or none of these erronous? If they are correct, what do the digits represent if there are five of them?
pastebin.com - SD/421.30061 SD/421.30063 SD/422.08101 SD/421.30065 SD/422.08071 SD/422.080 - Pastebin.com
Asking as I am storing all bits of information, i.e. page, line number and word position, in a separated form.
Also theres various sources with the format of i.e. PE13/155.9901-1, what does the "-1" mean?
Jan 14, 2017 (02:54)
The information after the page number is not entirely consistent. Mostly it means line number and word position, but I break that rule often. For example, Tolkien may designate a specific word as belonging to several languages, and I need multiple references from one word. Sometimes I split the elements of compound word into their own references. Sometimes I just add an extra digit and sometimes I use the -1 format. Over the years I've changed my mind several times.
Beyond the page number, I think the most I can guarantee is that the identifier should be unique.
Jan 14, 2017 (11:07)
Ah now as you say it, and I sort the list, it actually is fairly obvious!
As I'd read it now the four first digits are line number + word position as usual and the last digit the variant, often just "5", or "1", and eventually higher digits if more variants exist.
A few single ones that still puzzle me though:
PE13/147.30310: This is the only one with digit "0" at the end.
QL/060.70ive: has letters instead of digits.
TI/310.0034091: has a lot of digits.
PE19/093.20 13: has a space in between.
But forgive me for labelling so many things of the data as probably erronous, I absolutely don't want to put you to shame, not at all! I just want to make sure I don't misinterprete this valuable data, and also want to probably help you with discovering the very few inconsistencies over this massive amount of data. Actually for that you seemingly arranged all of this manually it is incredibly consistent!
Jan 14, 2017 (17:43)
No need to apologize. I find it very interesting when people use data in unexpected ways, and this can reveal flaws in the raw data. The four examples were all typos, so I've fixed in the source data and the fix will be in the next release.
Jan 16, 2017 (11:29)
Thanks a lot for the update! Especially that you settled on uniform namings of the speech attribute makes a programmatic analysis quite a bit easier :)
Now that I've extended the code to specially look for the length of the line number + word position bit I stumbled over a handful of sources which only have 3 digits. As there are only very few of those I assume these are not intentional:
- RGEO/63.030 ("A Elbereth Gilthoniel")
- LotR/0429.001 ("Methedras")
- LR/317.002 ("iChúrinien")
- S/154.001 ("Grond")
Also I think you've overlooked my remark on a few source reference prefixes ("books", to avoid confusion) which are not listed; I'm not familiar with all abbreviations, so I can't wholly judge which are just missing from the sources list and which may be typos, here they are again (all appear uniquely):
Jan 16, 2017 (18:17)
VT32 and PEE are both missing from the sources, so I've added them. The rest of the above are typos.
It takes a long time for a new full build of Eldamo, but I'm finding this process very fruitful, so I've checked a temporary data file with the modifications into Github here:
github.com - eldamo
[EDIT] When I clicked on the download button in my browser, it tried to load the xml file into my browser window, so you may want to right-click on the button and choose "Save As..."