Some tentative first steps towards a Star Trek universal communicator
Wominjeka Theatre | Sat 15 Jan 3:45 p.m.–4:30 p.m.
Presented by
-
Greg Baker
http://www.ifost.org.au/~gregb
Greg Baker is an entrepreneur (he's build and sold 2 businesses so far), author (6 books), translator (1 book), and an internationally- awarded composer and musician. Also he codes a bit.
Greg Baker
http://www.ifost.org.au/~gregb
Abstract
We urgently need computerised translation software for the rest of the world's languages. We will probably lose around 90% of the world's languages in the next 80 years.
If you want to build a translator that can translate all the world's languages, you can't use Google Translate's approach of training on millions of documents because most of the world's languages don't even have a million words written down. You have to be much more parsimonious with your data.
I've been writing software that populates the Leaftop database which has the goal of being the largest lexiconary (it currently has automatically extracted an average of 300 words from each of 1400 languages), and I am also building a universal grammar extractor which can currently inflect a plural from a singular for 11% of the world's nouns. It learned all the Latin noun declensions on its own.
This is a talk for language geeks and machine learning nerds. I'll talk about the weirdest distance metric you'll ever see (and why it is so easy to code), and
I'll talk about Hiligaynon and Swahili, why Chadian Arabic was so helpful and the trouble with Khmer. You'll see more unicode character sets in one presentation than you'll see in an internationalisation conference.
We urgently need computerised translation software for the rest of the world's languages. We will probably lose around 90% of the world's languages in the next 80 years. If you want to build a translator that can translate all the world's languages, you can't use Google Translate's approach of training on millions of documents because most of the world's languages don't even have a million words written down. You have to be much more parsimonious with your data. I've been writing software that populates the Leaftop database which has the goal of being the largest lexiconary (it currently has automatically extracted an average of 300 words from each of 1400 languages), and I am also building a universal grammar extractor which can currently inflect a plural from a singular for 11% of the world's nouns. It learned all the Latin noun declensions on its own. This is a talk for language geeks and machine learning nerds. I'll talk about the weirdest distance metric you'll ever see (and why it is so easy to code), and I'll talk about Hiligaynon and Swahili, why Chadian Arabic was so helpful and the trouble with Khmer. You'll see more unicode character sets in one presentation than you'll see in an internationalisation conference.