Leonard W. Jul 24, 2015 (22:36)

I do love eldamo as much as the rest of you guys, so I've been working hard to try to import its XML data source into Parf Edhellen. Today, I've (more or less!) successfully imported all the relevant definitions, with only some minor loss of etymological data. I will make sure to correct as much as possible in the days to come, and enhance search, so you can choose which sources you'd like to consult. I also hope to be able to produce direct links to Eldamo for more in-depth information, once Paul gets back to me.

So! I hope you guys find it useful! Oh, and let me know if there's anything else you'd like me to improve. :)
Parf Edhellen ~ Parma Eldaliéva
Parf Edhellen is one of the most comprehensive elvish dictionaries on the Internet, housing thousands of elvish names, words and phrases.

Leonard W. Jul 25, 2015 (00:52)

I've decided to redo the import, because the reconstruction markers fell away. It's running right now. Done!

Paul Strack Jul 25, 2015 (02:02)

Very nice! I am glad to hear you found the data useful. I saw your chat request but it was at 4 AM my time (California) so I was asleep.

Did you use the raw XML or the generated HTML? The reason I am asking is that I am getting ready to release a cleaned up version of the raw data and you might be able to use it for a better import.

As for linking, I may have to modify the data model for that. Right now the page ids are generated by an obscure algorithm and not exposed in the data. I will give it some thought.

Leonard W. Jul 25, 2015 (12:56)

Thank you for getting back to me so soon! :) I used the raw XML file.

I would naturally appreciate a cleaner version, but I have to say that it was fairly consistent! The biggest hurdles were:
* nodes, simply because of lacking documentation.
* cognate words without glosses. These words retrieve their glosses, as far as I can see, by referencing to cognate word nodes, but it's difficult to know which node to read from! I rather imperfectly use the XPath expression //word[@v="..."][@l] for these words.
* I also haven't parsed sub-level nodes (like naru and garth), so some loss of definitions definitely occurred.

I've realised that I'll need to continuously update these definitions in the coming weeks, so I've created a compound ID myself, based on the SHA1 hash of the language, word, gloss and grammar type ("speech" as you call it). These IDs doesn't help me link to you, but it helps me identify the definitions until such as a more robust algorithm can be identified.

/ Leonard

Paul Strack Jul 25, 2015 (15:21)

You guessed correctly on your XPath. Words are uniquely identified by their @v + @l attribute, so looking up the cognate link that way will locate the correct word.

In the 4.7 version of the lexicon, a lot of words are missing glosses, since I had not copied them from the refs in all cases. I fixed that in 4.8. I should release it sometime today or tomorrow.

I'd be happy to discuss the data structure. The 4.8 release also has some basic documentation (an XSD) but there is mostly auto generated and some things are likely to be unclear.

Paul Strack Jul 25, 2015 (17:37)

The 0.4.8 version of Eldamo is up. I also added a @page-id attribute to the word element that has the numeric id of a word's web page. You can use that for creating cross-links to{page-id}.html

Regarding the data, if you want to extract only dictionary words, you might want to filter out certain parts of speech (the @speech attribute): "grammar" entries, anything where @speech starts with "phone" (phonetic entries). You may also want to filter out "text" and "phrase" entries from the dictionary.

G. Hussain Chinoy Jul 25, 2015 (17:58)

+Leonard W. Great idea to create a compound key for referencing elements! I used something similar while importing eldamo for further analysis (L, V, Speech). Great job on, too!

Leonard W. Jul 26, 2015 (17:24)

With the release of 0.4.8, I've been able to link to Look for the globe icon at the right corner of the definition.