IScIDE 2012
Nanjing
Autoencoder for Polysemous Word
Wei-Chen Cheng Jiun-Wei Liou
Daw-Ran Liou
Cheng-Yuan Liou *
Dept. of Computer Sci and Information Eng National Taiwan University
Introduction & review
Generating a code for each word
Modeling word perception using the Elman network. Cheng‐Yuan Liou, Jau‐Chi Huang, Wen‐Chie Yang: Neurocomputing 71 (2008) 3150– 3157
Generating attributes for each word by linguistics experts
attribute 1 = water, Bank = attribute 2 = earth,
…,
attribute R = … ,
Cost and controversy for these manually assigned attributes.
Generating them Automatically !
Predicting next word’s attributes
9/28/2012 6
Figure: Illustration of Elman network.
9/28/2012 7
Renewed attributes
• Updating network’s weights after
presentation of each word to reduce the prediction error.
• Averaged prediction for each word is used as the renewed attributes after each training pass.
Generated attributes
Semantic categorization Indexing
Ranking
Stylish analysis
Categorization of
Shakespeare’s 36 plays
11
c: comedy r: romance h: history t: tragedy
Number denotes publication year
12
Indexing result without keywords Indexing result without keywords
http://red.csie.ntu.edu.tw/demo/literal/SAS.htm
Query Search result, Shakespeare plays she loves
kiss
BENVOLIO: Tut, you saw her fair, none else being by
herself poised with herself in either eye; but in that crystal scales let there be weigh.d. Your lady.s love against some other maid that I will show you shining at this feast, and she shall scant show well that now shows best.
‐Romeo and Juliet
armies die in blood
MARCUS AND RONICUS: Which of your hands hath not defended Rome, and rear.d aloft the bloody battle‐axe, writing destruction on the enemy.s castle? O, none of both but are of high desert my hand hath been but idle;
let it serve. To ransom my two nephews from their death;
then have I kept it to a worthy end.
‐Titus Andronicus
13
Ranking Shakespeare’s 36 plays
Authorship
15
Stylish analysis
Table: ‘RSMD’ values of William Shakespeare’s plays
9/28/2012 17
Table: ‘RSMD’ values of William Shakespeare’s works
9/28/2012 18
Polysemous word
• Difficulty of concept
• Many‐to one is a function,
one to many isn’t a function.
Polysemous word
Building a meaning pool matrix, M, for each word.
M contains B meanings (B candidates) in its column vectors.
B=2 for Polysemous word ‘bank’
money river
attribute 1, attribute 1, Bank = attribute 2, attribute 2,
… , …. , attribute R, attribute R
Predicting next word’s meaning
The code of the best predicated meaning in ‘M’ is used for the next input word.
Renewed attributes
• Updating network’s weights after
presentation of each word to reduce the prediction error.
• Averaged prediction for a specific
meaning of the next word is used as the renewed attributes of that
meaning after each training pass.
Figure: Illustration of Elman network for multi‐code.
9/28/2012 24
Experiments
Dream of the Red Chamber 紅樓夢
Romance of the Three Kingdoms 三國演義
Red Chamber has more than 841 thousands of characters and uses 5069 different Chinese characters
Pick 246 words (<5%) with fq in {fq ≥ 300 and ≤ 1200}
Dream of Red Chamber
246 words {fq ≥ 300 and ≤ 1200}
9/28/2012 27
Figure: Training errors using different pool sizes. Color vertical lines mark the minimum pass.
9/28/2012 28
Three Kingdoms has more than 570 thousands of characters and uses
5071 Chinese characters.
Pick 258 words in
{fq≥ 225 and ≤ 525}
Romance of the Three Kingdoms 258 words {fq≥ 225 and ≤ 525}
9/28/2012 30
Figure: Training errors using different pool sizes. Color vertical lines mark the minimum pass.
9/28/2012 31
Table: Characters have multiple codes. The total number of meanings of a character is labeled next to its character.
9/28/2012 32
Table: Sentences in Red Chamber which contain the same character with two different meanings, s=1 and s=4.
9/28/2012 33
Table: Sentences in Three Kingdoms which contain the same character with two different meanings, s=2 and s=4.
9/28/2012 34
Table: Samples of two names having multiple codes.
9/28/2012 35
The number of meanings
9/28/2012 36
Examples
• 話說王夫人見中秋已過,鳳姐病已比先減了,雖未 大愈,然亦可出入行走得了,仍命大夫每日診脈服 藥,又開了丸藥方(1)子來,配調經養榮丸。 (意:
單子)
• 說著,便袖了這石,同那道人飄然而去,竟不知投 奔何方(2)何捨。 (意:方向)
• 自取了筆硯紙墨出來,將方(3)才的詩,命她二人念 著,遂從頭寫出來。(意:剛剛)
• 妙玉送至門外,看她們去遠,方(4)掩門進來。(意:
才)
• 賈珍等拿了藥方(5)來,回明賈母原故,將藥方放在 桌上出去,不在話下。(意:帖)
9/28/2012 37
The number of meanings
9/28/2012 38
Examples
• 是非成(1)敗轉頭空:青山依舊在,幾度夕 陽紅。 (意:成功)
• 孔明曰:「曹操幼子曹植,字子建,下筆 成(4)文。操嘗命作一賦,名曰銅雀臺賦。
賦中之意,單道他家合為天子,誓取二 喬。」 (意:形成)
• 成(5)功不必添蛇足,討賊猶思奮虎威。
(意:勝利)
9/28/2012 39
Stylish analysis Authorship
Dream of Red Chamber
• Prediction error along each word:
9/28/2012 Cheng‐Yuan Liou 41
Romance of the Three Kingdoms
• Prediction error along each word:
9/28/2012 Cheng‐Yuan Liou 42
Summary
• Context‐based method (Changing scenario)
• Symbol–free sequence
• Meaning of a learned attribute can be calibrated by its similar words.
• Predicating the next word (symbol) of a given word sequence.
Applications
• Stylish analysis
• Authorship
• Semantic indexing,
ranking, and categorization
• Internet
• DNA, gene, or protein
• Cryptography
• Ancient language, machine translation
SARS ‘‘AY274119.3’’ genome.
‘white represents the largest error’
Influenza (1918)
• ACCESSION: AF116575.1
• (100)…GACACAGTACTCGAAAAGAATGTGACCGTGACACACTCTGTTAACCTGCTC…(150)
• (100)…112121125353115155515555532555535353535552512355222…(150)
• (500)…GGCTGACAAAGAAGGGAAGCTCATACCCAAAGCTTAGCAAGTCCTATGTGA…(550)
• (500)…152211215111551551135352532351552251135112235155555…(550)
• (1000)…GGACTAAGAAACATTCCATCTATTCAATCCAGGGGTCTATTTGGAGCCATT…(1050)
• (1000)…155351555153525321535152215223551512225252155523525…(1050)
A: 1, 5 T: 2, 5 C: 2, 3 G: 1, 5
Influenza (2009)
• ACCESSION: FJ966082.1
• (100)…GACACAGTACTAGAAAAGAATGTAACAGTAACACACTCTGTTAACCTTCTA…(150)
• (100)…134343153453133331335153343153343434545155334455453…(150)
• (500)…GGCTAGTTAAAAAAGGAAATTCATACCCAAAGCTCAGCAAATCCTACATTA…(550)
• (500)…114531553333331133355435344433314543143335445343553…(550)
• (1000)…GGATTGAGGAATATCCCGTCTATTCAATCTAGAGGCCTATTTGGGGCCATT…(1050)
• (1000)…113551311335354441545355433545313114453555111144355…(1050)
A: 1, 3 T: 3, 5 C: 4 G: 1
Detailed techniques and settings in IScIDE 2012 paper.
Museum of Cao Xueqin
• Born and grown in Nanjing
1715 or 1724 — 1763 or 1764
• 曹雪芹故居 江宁织造府 (大行宮)
Thanks
http://www.csie.ntu.edu.tw/~cyliou/