Mellyn!
I do love eldamo as much as the rest of you guys, so I've been working hard to try to import its XML data source into Parf Edhellen. Today, I've (more or less!) successfully imported all the relevant definitions, with only some minor loss of etymological data. I will make sure to correct as much as possible in the days to come, and enhance search, so you can choose which sources you'd like to consult. I also hope to be able to produce direct links to Eldamo for more in-depth information, once Paul gets back to me.
So! I hope you guys find it useful! Oh, and let me know if there's anything else you'd like me to improve. :) www.elfdict.com
Leonard W. Jul 25, 2015 (00:52)
I've decided to redo the import, because the reconstruction markers fell away. It's running right now.Done!Paul Strack Jul 25, 2015 (02:02)
Did you use the raw XML or the generated HTML? The reason I am asking is that I am getting ready to release a cleaned up version of the raw data and you might be able to use it for a better import.
As for linking, I may have to modify the data model for that. Right now the page ids are generated by an obscure algorithm and not exposed in the data. I will give it some thought.
Leonard W. Jul 25, 2015 (12:56)
I would naturally appreciate a cleaner version, but I have to say that it was fairly consistent! The biggest hurdles were:
* nodes, simply because of lacking documentation.
* cognate words without glosses. These words retrieve their glosses, as far as I can see, by referencing to cognate word nodes, but it's difficult to know which node to read from! I rather imperfectly use the XPath expression //word[@v="..."][@l] for these words.
* I also haven't parsed sub-level
I've realised that I'll need to continuously update these definitions in the coming weeks, so I've created a compound ID myself, based on the SHA1 hash of the language, word, gloss and grammar type ("speech" as you call it). These IDs doesn't help me link to you, but it helps me identify the definitions until such as a more robust algorithm can be identified.
Cheers!
/ Leonard
Paul Strack Jul 25, 2015 (15:21)
In the 4.7 version of the lexicon, a lot of words are missing glosses, since I had not copied them from the refs in all cases. I fixed that in 4.8. I should release it sometime today or tomorrow.
I'd be happy to discuss the data structure. The 4.8 release also has some basic documentation (an XSD) but there is mostly auto generated and some things are likely to be unclear.
Paul Strack Jul 25, 2015 (17:37)
http://eldamo.org/content/words/word-{page-id}.html
Regarding the data, if you want to extract only dictionary words, you might want to filter out certain parts of speech (the @speech attribute): "grammar" entries, anything where @speech starts with "phone" (phonetic entries). You may also want to filter out "text" and "phrase" entries from the dictionary.
G. Hussain Chinoy Jul 25, 2015 (17:58)
Leonard W. Jul 26, 2015 (17:24)