Skip to content

TinT

Requirements#

Requires the ldc-tint library.

Plugins#

Detect Māori language#

The detect-maori can be used for filtering out records that don't have enough Māori characters or too many non-Māori ones by applying user-supplied threshold. Below, a maximum on 10% of non-Māori characters are tolerated before the record gets discarded:

ldc-convert \
  -l INFO \
  from-alpaca \
    -l INFO \
    -i "./input/alpaca_data_cleaned-mi.json" \
  detect-maori \
    -l INFO \
    -M 0.1

An alternative filter is-maori uses the reo-toolkit to determine whether text is Māori or not. Below, a minimum of 70% words must be Māori:

ldc-convert \
  -l INFO \
  from-alpaca \
    -l INFO \
    -i "./input/alpaca_data_cleaned-mi.json"
  is-maori \
    -l INFO \
    -m 0.7  

Handling macrons#

You can use the de-macronize filter to remove/replace macrons:

ldc-convert \
  -l INFO \
  from-txt-pt \
    -l INFO \
    -i "./input/māori.txt" \
  de-macronize \
    -d strip \
    -l INFO \
  to-txt-pt \
    -l INFO \
    -o "./output/maori_stripped.txt"