Multi-Lingual AtD

Now you can use After the Deadline for Dutch, French, German, Indonesian, Italian, Polish, Portuguese, Spanish, or Russian.

Included in this package are contextual spell checking models trained from Wikipedia using dictionaries from Open Office. Grammar checking is available for some languages courtesy of the LanguageTool.org project.

Please grab this archive only if you need it. It’s over 500MB.

Installation

Install the After the Deadline source code distribution. Then extract this archive over it (note: archive has atd/ folder).

tar zxvf atd_lang20140430.tgz

Running Multi-Lingual AtD

To run a multi-lingual AtD server, specify the -Datd.lang= property when running After the Deadline. This tells AtD to load the lang/[atd.lang value]/load.sl file which redefines how AtD parses sentences and loads resources for spell checking.

Included in the AtD distribution archive are scripts to run an AtD service for each language. e.g.,

./bin/spanish.sh

After the Deadline software does not store any state. You may run multiple instances of After the Deadline from the same directory.

Be aware that the low-memory mode does not apply to non-English AtD. If there is enough demand, it’s technically feasible to make this happen.

Rebuilding the Language Models

From the top-level atd directory you can execute build.sh in any language directory to rebuild the language models. You’ll need to populate the corpus directory of the language with data from Wikipedia. I provide instructions on how to generate this data on the AtD blog.

Here is an example build.sh file for Indonesian:

export TARGET=id
export THRESHOLD=10

source lang/lib/makedict.sh
source lang/lib/create.sh

The TARGET value is the two character (by convention only) language id. The folder inside of the lang folder must have this same name. The THRESHOLD value represents how many times a word must be seen in Wikipedia before it’s added to the dictionary. Use this if you want to augment the existing dictionary with words from Wikipedia.

How to Add Another Language

  1. Make a new folder for your language in the lang/ directory.
  2. Copy atdconfig.sl and build.sh from another lang directory and modify it to reflect your language id
  3. Generate a plain-text corpus for your language and place this in lang/[id]/corpus
  4. Place any dictionary files you have in lang/[id]/wordlists. No wordlists? No problem. Generate one from your plain-text corpus (build.sh can do this, just have a very low threshold value).
    • Tip: Use unmunch (included with Hunspell) to create a flat word-list from a .dic and .aff file.
  5. If Language Tool supports your language, copy load.sl from the lang/fr directory and modify it to match your language id
  6. If Language Tool does not support your language, copy load.sl from the lang/pt directory and modify it to match your language id. This option will give you contextual spell checking only
  7. Run ./lang/[lid]/build.sh to build your new lang models.
  8. Use ./bin/french.sh as a template to create a start script for your language

Wanted: Misused Word Detection

Missing from the language pack is AtD’s statistical misused word detection. This is a trainable feature. We simply need a text file containing a comma separated list of words to disambiguate on each line. From this information we can create a real-word error detection and correction tool for you. If you have one of these, contact us and we’ll hack this feature into the language pack.