Post 6dVTwVqqUxk

Lúthien Merilin Aug 12, 2015 (11:38)

Elvish dictionary app(lication)

Mellyn,

following a session at the Omentielva Enquea about "Identifying the Requirements for a Quenya & Sindarin Dictionary App" Roman Rausch and I figured that this might be the best place to follow-up on it (being by far the most active community at this time). 

The reason that I brought it up in the first place was to find out if there was any need for 'another dictionary app' (actually, there seem to be some new developments that I haven't checked out yet, but will asap) and, if so, what features people would consider useful to have. 
One major issue that Roman and I wanted to address was the fact that the content of existing applications is essentially deprecated as soon as new material or new insights get published (e.g. PE22). A printed word list or an application can only be updated by re-creating it; but even online resources (like an online word list) often depend on the availability and energy of one person that maintains them. 

Sometime ago we discussed the possibility of maintaining a central repository containing the words and translations, which could then be pushed or pulled to client applications in order to update their contents. The problem with this construction is that it requires one or more persons who are responsible to maintain this repository (both linguistically and technically), which might be problematic. 

But the Wikipedia model of allowing (registered) users to edit the content (and trusting the community to correct possible errors and vandalism) could possibly work in this case: a user could then add / edit entries using the application, which would then be pushed to all other users. This would get around the need for a centrally maintained repository and facilitate cooperation among multiple users ("you enter entries +#1-30, I'll do the rest"). 
I think that this could work because the people who would use it aren't as anonymous as the ones using Wikipedia - and even there it works. 

The discussion at the Omentielva last week resulted in quite a number of 'user requirements' and points to consider: 

- ways to indicate various layers of reliability of the vocabulary: kept, deleted, replaced ...
- provide etymological information: s vs. th in Quenya, special case mutations for Sindarin
- a switch to exclude reconstructions
- names should be included
- priority #1: a working interface, better to have a small running application and extend it later
- buttons for user-friendly "easy-regex" searches: end of a word, beginning of a word, ...
- full regex functionality for the advanced users
- generate new vocabulary by applying rules to the existing list (?)
- check out leo.de ("like google translate, but better than google translate!"): has derived words, examples of use
- general problem: accomodate both academic & non-academic users
- provide Tolkien's roots in the entries
- indicate which other words were discussed by Tolkien in the same gloss
- treat different attestations as different items as much as possible, only join them when the gloss is exactly the same
- mark deduced words: S. forochel -> +#hêl
- functionality: users can provide alternatives to English words if a particular word isn't found (e.g. 'translucid', but not 'transparent') which will pop up as suggestions for the others

To provide everything that was suggested would of course require a lot of data input, but it's perfectly OK to have that grow gradually. 

The first thing to tackle now would be to sort out some technical issues (such as which platform to use, the data model etcetera) and produce of proof-of concept version for people to test. 

I suppose that it would be helpful to create something like a Google doc to keep track of lists of requirements etc.

Questions, comments, suggestions are more than welcome!

Francesco Veneziano Aug 12, 2015 (12:31)

Hello Lúthien,
http://eldamo.org/
by Paul Strack
is the lexicon that I mentioned.
It looks like an excellent resource and you should check out the scheme documentation, which already seems to include many or the requested, or desirable features we discussed.
I remember that you mentioned performance issues as a downside of XML (or maybe I didn't understand well), but Eldamo seems to be very fast.

Tamas Ferencz Aug 12, 2015 (13:09)

Well, I started EldarinWiki some years ago (members of Aglardh may remember it); it was a wiki, based on MediaWiki, the platform on which Wikipedia runs; it was collaborative, and its goal was to collect all attested Eldarin words and morphemes in a wiki format, with cross references and semantic categorization and all. It died a slow and terrible death as there was very little interest in the community to contribute and I alone did not feel strength in myself to work on it day by day. It is no longer online.
Thank Eru and +Paul Strack that we have Eldamo now, which does all those things, only better. I think it functions just well in a browser, but perhaps one day someone writes some frontend apps for it, for Android or GNOME or OSX or whatever they fancy.

David Giraudeau Aug 12, 2015 (13:26)

I work since few years on a project of "New Q(u)enya Lexicon". In order to retain any information from each entry, and in addition to Lóthien's list above, I think it could be useful to give information about:

- any standardization applied to the entry (c vs. k for instance),
- the full etymology (not only the root)
- notes on the context or on external informations (e.g. editorial notes, interesting comment from a post in a mailing list, etc.)
- quotes related to the entry (e.g. a sentence from Galadriel's Lament)
- internal (i.e. Arda's) dating
- external dating

Lúthien Merilin Aug 12, 2015 (13:42)

Thanks!

Tamas Ferencz Aug 12, 2015 (14:04)

+David Giraudeau
If you think of it, VinQuettaParma has a similar goal, albeit on a much smaller scale and only for neologisms.

Leonard W. Aug 12, 2015 (19:43)

+Lúthien Merilin I don't know if you've seen elfdict.com? Parf Edhellen (www.elfdict.com) is meant to be a community-driven online dictionary app for Tolkien's languages. It's moderated today by myself, and contains many of the devices you describe above.

I've been working on the project for several years, and it's data model can certainly be improved, but I'd be delighted if you'd consider the website for this purpose. I'm naturally available should you have any questions.

I'm not interested in making money on Parf Edhellen, and I've disclosed the full source code on Github: https://github.com/galadhremmin/Parf-Edhellen

Paul Strack Aug 12, 2015 (21:46)

+Lúthien Merilin As you can see from this thread, there are a number of existing projects that have goals similar to yours.

One other thing to consider that I haven't seen suggested yet is tracing the conceptual development of the various forms of words. I think this is especially important for academic purposes.

Also, the biggest challenge by far for such a project is the data entry or collection. You can probably put together an effective data model and a data entry system in a couple of months. The data entry will require years, possibly decades of effort. Whatever you do, you want to optimize around ease of data entry, and if possible leverage existing data sets.

In my experience, most people (myself included) are happy to share data, and you could get a big jump start on the project if you don't have to start from scratch.

Leonard W. Aug 12, 2015 (23:15)

+Paul Strack Yes, Eldamo.org is in my opinion a great platform to build on. Your open approach to it enables it to be an excellent data source to project/apps such as ElfDict.

I just want to put that out there, that I'd also vouch for Eldamo, for what it's worth!

Roman Rausch Aug 14, 2015 (10:52)

Thanks to everyone for their comments.
To clarify what we want to achieve: There seems to be a curse hanging over Elvish dictionaries, in that if they're maintained by one person, once that person loses interest or has no longer time to maintain it, the rest of the community is simply stuck with last state of the dictionary. We saw that happening with Hiswelóke which is discontinued, and with Quettaparma, which doesn't seem to get any updates anymore.
The idea is to break this curse by enabling the user to enter and correct words in a wiki-like fashion. The problem with EldarinWiki, was, I think, that it was too much of a blank slate - as Paul said, entering the data from scratch may take close to decades. We have in any case an up-to-date Sindarin list which I maintain at sindarin.de. For Quenya, we'd need some help - if it is possible to import Eldamo's data (both in the lerta and pole sense), that would be great indeed.

+David Giraudeau 
The dating is a good point I hadn't considered. I think the best course of action would be to note which text a word appears in, and then give a list of texts with dates separately.

Tamas Ferencz Aug 14, 2015 (11:48)

+Roman Rausch
I agree, EldarinWiki was too ambitious, especially as I wanted it to be all-encompassing, involving all languages. It would require an active community with the strength of hundreds or thousands of devoted users to ensure consistent growth. That is a dream - looking at the number of members who are active in this community for instance, it's hardly 5% of all the members - and this one is considered an active community these days.
But I agree that an initial data import into a wiki-like structure, from whichever source it is technically achievable, would be a great kickstart.

Lúthien Merilin Aug 14, 2015 (13:59)

+David Giraudeau +Tamas Ferencz +Francesco Veneziano +Paul Strack thank you for your comments! I am indeed yet unfamiliar with eldamo.org, eldarinwiki and Parf Edhellen so there's a lot to look into :)

It is indeed as Roman says: that quite a few people have put much effort to create an 'elvish dictionary' of some form or another; but that it turns out to be very difficult to keep it in working condition, so to speak. It requires a serious and sustained effort that may indeed be too much for one indiviual to bring up despite the obvious enthousiasm, knowledge and skills that go into it,; after all we all have other obligations and a life to live.

That's why a construction that enables the community itself to keep it updated (or even to extend it) might work better.

What I would like to do is to decouple the data source from the interface: to create a 'repository' containing all the words, translations, annotations, examples and everything else that we want. This should be set up in such a way that it is possible to add features and data without changing the structure.
This repository could then be used by either a website, a desktop application or a mobile app.

Re. the content: as Roman also points out, quite up-to-date word lists already have been compiled. Of course they won't contain everything that we would ideally like to see, but that need not be a problem at all because content can always be added later.

The first thing to do would be to work out a data model that allows all this; then we could import the existing content in there and create a first 'interface' to use that content. Whether that would be a website or an application does not make much difference; eventually they could all co-exist using the same data.

I'll have a look at the data models that I have and whether or not they can represent everything that's required, or would need some more work in order to achieve that: one is a relational version of the TEI-schema as used by Dragon Flame / Hesperides / Hisweloke; and the other I had created to accomodate the Sindarin list compiled by Roman. The latter one is a lot simpler (and 'normalized' in database design terms) and might be an interesting starting point.

Paul Strack Aug 14, 2015 (15:15)

Seperating the data from application is definitely the right way to go. In terms of organizing the data, I would suggest focusing on three types of entities.

1) Attestation - a specific attested form appearing in the source material. This form should be recorded exactly as it appears in the source, without any kind of normalization. The attestation should be recorded as objectively as possible, so that any two members of the community could easily agree on its correctness.

2) Word - a dictionary entry, grouping a set of related attestations. For example, you might group attestations of têw and tîw as singular and plural forms of the Sindarin word têw "letter". This is necessarily a subjective process, so you need to figure out how to handle differences of opinion. For example, should N. toltho "fetch" be normalized to S. toltha- or S. tolla- to reflect Tolkien's revisions of phonology between Noldorin and Sindarin. To support different community opinions, you may need to allow an attestation to be assigned to multiple words.

3) Relationship (either between attestations or words). You probably want to be able to handle a variety of relationships. Possibilities include:

a) Inflection: tîw is the plural of têw.

b) Derivation: têw is derived from the root TEK.

c) Cognate: Q. tengwa is the cognate of S. têw.

d) Revision: Tolkien changed the form tolthui to tollui.

I think if you built a system that could handle those 3 types of entities it would be robust enough to evolve gracefully.

Tamas Ferencz Aug 14, 2015 (15:41)

Perhaps the central database should be just that: a database? Sqlite? MySQL? With a standardized field structure, and forms to enter the data?
I would also very much like to see some sort of semantic categorization (in EldarinWiki we tried to use Buck's semantic categories). In that way, if I wanted to find out what etymological process Tolkien used to create words related to, for instance, tools, I could easily get a listing of all words in all languages in that semantic category.

Lúthien Merilin Aug 14, 2015 (15:45)

here's the diagrams of the Hisweloke database, created from the XML TEI schema: https://goo.gl/photos/bCmaqxUj2QdM7WSe9

Lúthien Merilin Aug 14, 2015 (15:59)

and this is the new schema based on the Sindarin wordlist by Roman.

https://goo.gl/photos/oTxvmeRoaU2vNc1D6

This will supposedly need to be extended to accomodate (for instance) grammatical information and examples, but I believe it would not nearly needed to be as elaborate as the other schema. Also, using abstracted / generic entities (e.g. 'Dataclass' and 'Metadata') allows for much more flexibility as the data structure is not so much defined in the table structure but (partly) in the contents of the tables.
By way of example: say, you'd have a table with Quenya entries and one with Sindarin ones. Adding Telerin would then require a change of the data model. But if you have a generic table with Entries, and one with languages so that every entry has a reference to the language table, adding support for another language is simply a matter of adding a row in the Languages table.

Carrying this principle a bit further could result in much more flexibility and a simpler database, admittedly at the cost of making the schema more abstract and thus harder to read at first glance (and possibly a bit slower in performance). But in this case that shouldn't be a problem though.

Lúthien Merilin Aug 14, 2015 (16:00)

+Tamas Ferencz  yes, exactly!
SQLite should in fact be perfect, given the minimal overhead. It also performs very well: last year at work I used it for quite a large dataset taken from Wikidata (all metadata and summaries from the English Wikipedia) and with the right indexing even that performed very fast. What we have is only a fraction of that amount, so that should be no problem at all.

Leonard W. Aug 14, 2015 (16:04)

I'll try to get back to you with Parf Edhellens structure soon! It's MySQL and has much of what you've discussed already.

Paul Strack Aug 14, 2015 (16:46)

Another factor to consider it data import/export. To be manageable, the system should be able to handle bulk imports, preferably from something like a spreadsheet that non-technical users can handle easily.

For example, when PE23 is published with a thousand or so new attested forms, it should be possible to enter the data into a spreadsheet and import that rather than having to open and edit a thousand separate pages in a data entry UI.

You are going to need to build an import feature anyway to load data from existing sources, so making support future bulk import needs would be very helpful. Ideally, you should be able to do bulk exports as well (in the same format) to facilitate off-line reviews and editing.

Lúthien Merilin Aug 22, 2015 (12:36)

a heads-up: after some discussion with Roman I think it's best to first work on the data model so that it can accomodate everything that is needed. I'll probably have questions re. what "everything that is needed" should consist of and how entities inter-relate later and post them here!

I also did some research about what kind of platform to use. For now, it looks as if SQLite is a good candidate for the database, and Python seems a good candidate for a coding language because it is widely used and supported.
That still left the question what to use to build the GUI with (the thing that you actually interact with), for there are many different options to choose from in Python. I tried out QT (& PyQT) last week but found it a rather hefty thing to install and set up (plus that I'm not completely certain that the QT usage license is what we want). So I looked further and found Kivy (http://kivy.org) : a much more light-weight framework that promises full compatibility with mobile devices. I'll test that and see if it fulfills the needs for this case.

Tamas Ferencz Aug 22, 2015 (13:48)

+Lúthien Merilin one upvote for Python. As for the front-end: how about a cross platform wx (via wxPython); or, just make a stronsg backend, and then people can create various frontends depending on the platform and OS they work on

Lúthien Merilin Aug 22, 2015 (18:49)

Ah, I'll have a look at WX as well!

Re. "Just create a strong back-end": the backend is indeed the most important, but I don't want to let the opportunity pass to create a GUI - I rarely get the chance at work, and I just like doing that for a change!

That doesn't preclude that others can create front-ends as well, I'll make sure to create a solid API on the back-end.

Paul Strack Aug 22, 2015 (21:53)

For defining the data model, I think you can achieve a lot by setting up generic data structures. For example, instead of creating a separate table for inflections, derivations, etc., you can define a generic "relationship" table to cover all kinds of relationships.

RELATIONSHIP

SOURCE_ID
TARGET_ID
TYPE
QUALIFIER
COMMENT

Here TYPE can be "inflection", "derivation", "revision", "cognate", "element" and other types as needed. The QUALIFIER field can be used for relationships like inflections that have a subtype (past, soft-mutation, etc.).

I recommend an optional COMMENT field because sometimes a relationship may need additional information.

Lúthien Merilin Aug 23, 2015 (09:20)

+Paul Strack, thanks yes, I was thinking in that direction as well!

Lúthien Merilin Aug 23, 2015 (22:57)

it seems I missed some comments:

+Leonard W.  - re. 'Parf Edhellens data model' - thanks, that'd be great!

+Paul Strack - re. bulk import / export: that's a good point. Maybe something like the data import feature that Excel offers (1: choose format, 2: choose field delimiter, 3: map columns)? I'd think that it would still be quite error prone ... depending of course on the complexity of the data to be imported. This is one to thorougly think over! Maybe it's an idea to put some strict validations on the format of the data to be imported (in limiting the number of allowed formats, specifying required columns etcetera)?

It's also entirely possible (even likely) that someone already wrote csv import functionality in Python, which could supposedly be added as a library without too much effort.

We also should keep in mind though that the effort of developing a solid import feature should not grow much beyond  the effort of importing a bunch of data such as from PE23 - how often does something like that happen? If that's a once-in-five-years occaions It could also be an idea to ask someone with sufficient regex skills to rework the data into a SQL import file :)

Lúthien Merilin Aug 23, 2015 (23:13)

+Tamas Ferencz  (missed this comment as well)
 
re. "Perhaps the central database should be just that: a database? Sqlite? MySQL? With a standardized field structure, and forms to enter the data?"

The database doesn't need to be central. In fact, this was what stalled our previous attempt, because a central database implies one or more 'guardians' to maintain that database, resulting in difficult-to-answer questions such as "who's going to do that?", etcetera :)

If we distribute the database across all clients maintenance will become a shared responsibility across all clients / users, with a synchronisation feature that pushes edits to any database to all others. Or maybe not "all" if everyone feels that's too risky - though we can also include a simple roll-back mechanism to counter possible vandalism.

Re. "standardized field structure" .. not sure what you mean with "standardized"? In any case one that will facilitate the community's needs.

Re. forms to enter the data - yes, certainly. I suppose it makes most sense to include that in the dictionary app, via a menu item "edit contents" or something like that.

Leonard W. Aug 23, 2015 (23:16)

I'm sorry guys for not getting back to you with a DB model. My schedule has been crazy the last couple of days, but I'll try to get back to you next week.

As for a central database, perhaps a Github account would work?

Lúthien Merilin Aug 23, 2015 (23:19)

+Leonard W. no hurry, I've not actually started on the data model yet: I'm first trying to settle on the code language and GUI framework, trying things out, etcetera.

Github is fine as the repository!

Tamas Ferencz Aug 24, 2015 (08:22)

Indeed Python has a built-in csv module
https://docs.python.org/2/library/csv.html

Leonard W. Aug 25, 2015 (09:54)

Json is probably a better choice: it's more self-explanatory than CSV, and yet not as verbose as XML.

Lúthien Merilin Feb 15, 2016 (20:40)

Mellyn,

I was discussing the requirements for the database with Roman, but it's as if I haven't been able to reach him for a while.
I've also been quite busy at work for some time, but want to pick things up before it grinds to a complete halt.

I'll try to give a summary where we left off; that will also help me refresh my memory.

Paul Strack came with the excellent idea to separate the data from the application. He also suggested considering these three types of entities:

1) Attestation - a specific attested form appearing in the source material. This form should be recorded exactly as it appears in the source, without any kind of normalization. The attestation should be recorded as objectively as possible, so that any two members of the community could easily agree on its correctness.

2) Word - a dictionary entry, grouping a set of related attestations. For example, you might group attestations of têw and tîw as singular and plural forms of the Sindarin word têw "letter".
This is necessarily a subjective process, so you need to figure out how to handle differences of opinion. For example, should N. toltho "fetch" be normalized to S. toltha- or S. tolla- to reflect Tolkien's revisions of phonology between Noldorin and Sindarin. To support different community opinions, you may need to allow an attestation to be assigned to multiple words.

3) Relationship (either between attestations or words).
You probably want to be able to handle a variety of relationships. Possibilities include:
a) Inflection: tîw is the plural of têw.
b) Derivation: têw is derived from the root TEK.
c) Cognate: Q. tengwa is the cognate of S. têw.
d) Revision: Tolkien changed the form tolthui to tollui.


Roman suggested to encode every attestation of a word as it appears in the published material, without any adde opinions. As a bare minimum one would have something like that:

- attestation table -
entry id:
lang : the language - eg. N [for Noldorin]
gloss : the word as it appears in a source - eg. calf
trans : Tolkien’s exact translation - eg. water-vessel
ref : the reference to the source - eg. Ety/362
deleted : was dit eleted by Tolkien or allowed to stand - eg. false

with the possibility of adding more columns, like:
source : the name of the text a word appears in - eg. the essay ‘Quendi and Eldar’)
root : the proto-form as given by Tolkien - eg. KALAP

- translation table, containing translations to other modern languages -
entry id:
DE: Wassergefäß
FR: Récipient d'eau

Tables for normalizations:

- normalization table X/PH -
entry id:
-> calph

- normalization table X/LL -
entry id:
-> tollui

- normalization table X/LH -
entry id:
-> tolhui


Etymological information, relevant for the individual languages.
For Sindarin, special case mutations; for Quenya, the th/s etymologies:

- special case mutation -
entry id:
scm: true / false

---etymological info: Quenya-th---
entry id:
th_spelling: thúre [for súre 'wind' - I don't think it can be just a boolean here, since a word can have several s's]

A correction table for misprints, e.g. thann was misread as thenin by Christopher Tolkien, but corrected in VT/46:16:

- corrections table -
entry id:
corrected: thann
ref: VT/46:16

Several relationship tables:

- plural table -
singular entry id:
plural entry id:

---past tense table---
verb entry id:
past tense entry id:

---variants table---
variant1 entry id: [the first form, e.g. gweith out of gweith, gwaith]
variant2 entry id: [the second form, e.g. gwaith out of gweith, gwaith]
variant3: [...]

--revisions table--
changed_from entry id:
changed_to entry id:

--corpus texts table--
attestation id:
corpus_text text_id:

Roman suggested we postpone relationships like cognates and compound derivations for the time being.

If anyone has comments or suggestions I'd be more than happy to hear them. My next goal is to work this out in a more formal database model.

Thanks,
Lúthien

Paul Strack Feb 15, 2016 (21:23)

You can save yourself a lot of work by making your relationships generic. Rather than having separate tables for plural, past, etc, have a generic inflection table that indicates the nature of the infection with a type field:

INFLECTION
==========
BASE_ENTRY_ID
INFLECTED_ENTRY_ID
TYPE

Where an entry has multiple simultaneous inflections, you can do a space-delimited list:

"past plural"

This will also make you data entry easier by cutting down on the number of relationships you need to track and enter.

Lúthien Merilin Feb 16, 2016 (01:19)

Thank you, I will. Actually, my previous design was like that as well.

Which reminds me of a colleague I had, when I lived in Victoria, BC: a veteran from the Alberta oil fields who looked like a bearded pirate (only the eye patch and the wooden leg were missing). He was always telling stories how he had coded something functionally similar to whatever was 'en vogue' at the moment (such as messaging) - but using 1/1000st of the memory we needed now, on a PDP11 or whatever he was using around 1978 :)

I suppose that background was the reason that he always urged me to push the design further in the direction of a completely abstract entity–attribute–value model, until reading it became almost as solving a cryptogram. Which was actually a heaven-sent by-effect, since it stopped our horrible micro-managing manager from having me make trivial changes just to throw his weight around (eg.: "can you make that column varchar(15) instead of INT ... oh, and also change the name from hmmm ... 'feed_voltage' to 'required_electrical_specified_input_value'").

But even apart from that, it was often a good move, making the database a lot simpler and more flexible: if some new data type needed to be added you only had to add a row and not alter a table definition to add a column (or worse, add a whole table).

I've seen many db's since then that were as sparsely populated as an Emmenthal cheese and that could have squeezed been in far less tables with far less columns. Of course there's a bit of a performance penalty, but in our case that's completely irrelevant.