行政院國家科學委員會專題研究計畫 成果報告
貝氏網路與分類技術之基礎研究與應用:建構學生學習歷
程之模型與語意標記(第 2 年)
研究成果報告(完整版)
計 畫 類 別 : 個別型
計 畫 編 號 : NSC 95-2221-E-004-013-MY2
執 行 期 間 : 96 年 08 月 01 日至 97 年 10 月 31 日
執 行 單 位 : 國立政治大學資訊科學系
計 畫 主 持 人 : 劉昭麟
計畫參與人員: 碩士班研究生-兼任助理人員:何君豪
碩士班研究生-兼任助理人員:鄭人豪
碩士班研究生-兼任助理人員:呂明欣
碩士班研究生-兼任助理人員:林仁祥
碩士班研究生-兼任助理人員:張智傑
碩士班研究生-兼任助理人員:賴敏華
碩士班研究生-兼任助理人員:藍家樑
報 告 附 件 : 出席國際會議研究心得報告及發表論文
處 理 方 式 : 本計畫涉及專利或其他智慧財產權,2 年後可公開查詢
中 華 民 國 98 年 01 月 26 日
行政院國家科學委員會補助專題研究計畫
█ 成 果 報 告
□期中進度報告
貝氏網路與分類技術之基礎研究與應用:
建構學生學習歷程之模型與語意標記
計畫類別:█ 個別型計畫 □ 整合型計畫
計畫編號:NSC-95-2221-004-003-MY2
執行期間:95 年 8 月 1 日 至 97 年 10 月 31 日
計畫主持人:劉昭麟
共同主持人:
計畫參與人員:碩士班研究生:何君豪、鄭人豪、呂明欣、林仁祥、
張智傑、賴敏華、藍家樑
博士班研究生:無
成果報告類型(依經費核定清單規定繳交):□精簡報告 █完整報告
本成果報告包括以下應繳交之附件:
□赴國外出差或研習心得報告一份
□赴大陸地區出差或研習心得報告一份
█出席國際學術會議心得報告及發表之論文各一份
□國際合作研究計畫國外研究報告書一份
處理方式:除產學合作研究計畫、提升產業技術及人才培育研究計畫、
列管計畫及下列情形者外,得立即公開查詢
█涉及專利或其他智慧財產權,□一年█二年後可公開查詢
執行單位:國立政治大學資訊科學系
中 華 民 國 98 年 1 月 20 日
I
中文摘要
本研究案主要著力於學生學習模型、電腦輔助試題翻譯、電腦輔助語文教學和中文訴訟文
書分類四個研究主題。在學生學習模型方面,我們以貝氏網路來表示學生學習模型,並且
提出了一個方法來學習學生的學習模型的方法。在電腦輔助試題翻譯這一項工作中,我們
建構了一個實際的系統,用以輔助專家翻譯 TIMSS 試題。在電腦輔助語文教學這一項工作
中,我們建構了一個真實的系統,可以輔助國語科教師編輯試題。中文訴訟文書分類並不
是這一次研究案的主角,是我們結束前一國科會研究案的工作。本次研究計畫執行期間,
合計發表 17 篇論文(兩篇國際期刊論文、三篇國際學術研討會論文國內學術會議方面,則
有六篇 ROCLING 論文、四篇 TAAI 論文、一篇 NCS 和一篇 TANET 論文)
,總頁數達到 133
頁;其中包含一篇人工智慧與電腦輔助教學跨領域研究的優質期刊論文(IJAIED)和一篇計
算語言學優質研討會(ACL)的研討會論文。
關鍵詞:貝氏網路、學生學習歷程、建模技術、資訊檢索、電腦輔助語文學習、機器翻譯
Abstract
In this report, we summarize the results of this research project on several fronts.
For student modeling, we proposed a simulation-based approach to learn the
struc-tures of Bayesian networks that contain unobservable variables. We have built three
functioning systems for practical applications of natural language processing
tech-niques. We built an environment for computer-assisted translation of TIMSS test
items, an environment for assisting teachers to compose test items for elementary
Chinese, and an environment for searching Chinese indictment documents.
Keywords: Bayesian networks, structure learning, learning processes of composite
concepts, information retrieval, computer assisted language learning, machine
translation
II
目錄
報告內容 ……… 1
前言 ……… 1
研究目的 ……… 1
相關文獻 ……… 2
研究方法 ……… 2
研究成果 ……… 3
論文列表 ……… 4
計畫成果自評 ……… 6
附錄 ……… 7-22
1
報告內容
前言
本研究案雖然不是一個整合型計畫,但是一開始即訂定多項目標,因此實際上很難以
一份報告來總結所有子項計畫的研究成果。因此,在這一份報告中,我們為個別子項計畫
撰寫簡要的文字資料,然後請有興趣深入研究的讀者繼續研讀已經發表的期刊論文或者學
術會議論文。
這一個研究案從事兩大類但相互關連的研究工作:一個是認知歷程的建模技術,另一
個則是以自然語言處理為基礎的實際軟體系統的建置。認知歷程的建模技術方面,我們以
學生學習綜合觀念的問題作為研究的主題。實際軟體系統方面,我們建立了三個不同的系
統:中文訴訟文書檢索系統、電腦輔助 TIMSS 試題翻譯環境和電腦輔助國語科試題出題環
境。
在實際的環境中,如果我們想要利用人工智慧技術來讓軟體系統提供使用者最好的服
務,瞭解使用者真實的需求是必要的基礎。實務上我們很難經常性地詢問使用者的需求和
回饋,因此從間接的資訊來推測使用者的興趣或者意向是重要的基礎技術。所以,以上兩
大類的研究工作,以長遠的角度來說是有密切關連的。現階段的工作是一個逐漸打底的工
作;我們期待繼續朝綜合人工智慧技術、機器學習技術和自然語言處理技術來建構有用的
資訊檢索環境和電腦輔助語文學習的環境。
在研究進度方面,計畫主持人全力從事認知歷程的建模技術,因此這一部分的成果比
較能夠掌握。應用自然語言處理技術來建立實際系統的部分,則全部是以碩士班研究生執
行,雖然能夠維持一些進度,但是計畫推展的速度並不能令人完全滿意。
在研究成果方面,我們發表了 17 篇學術論文,總頁數達到 133 頁。在研究成果小節中
我們將分析所達成的成果。我們把各項主要子項工作比較具有代表性的論文附在本份報告
的附錄中。就如前面所說明,這一份報告的本身其實只能是我們所進行的所有工作的大摘
要而已,所有工作的真正成果已經反映在所發表的論文之中,因此雖然我們必須把論文放
在附錄,但是其實論文本身才應該是這一個研究案的成果的真正主角。
附錄包含了四篇論文:IJAIED 的期刊論文一篇(建模技術相關論文,
這是一篇出版商
有版權的文章,不宜在網路上公開
),ACL 國際學術研討會論文一篇(電腦輔助國語科試
題出題輔助系統)
, ROCLING 國內學術研討會論文一篇(電腦輔助 TIMSS 試題翻譯環境)
和 TAAI 國內學術研討會論文一篇(中文訴訟文書檢索系統)。
研究目的
我們分四個段落簡述四個不同的子項目的研究目的。詳細資料請參閱相關論文。
在建立使用者模型方面,我們希望能夠找到一個好的辦法,讓我們可以在不能夠直接
觀測模型中所有相關變數的狀態的情形之下,仍然能夠以貝氏網路來表示所有相關變數的
直接和間接機率關係。在所進行的研究中,學生的答題的反應(目前僅以「對」和「錯」
表示)是可以直接觀測的變數,而我們所建立的模型包含了學生對於個別觀念的能力。能
力與答題的對錯雖然有密切關係,但是關係卻不是邏輯式的,因為有人會因為運氣好答對
2
題目,也有人會因為一時疏忽等複雜原因,在有相關能力的情形之下,卻沒有答對題目。
簡單地說,本項研究是要以學生的答題的對錯來反推學生的學習模式的貝氏網路。
在中文訴訟文書檢索系統中,我們採用了幾種資訊檢索和人工智慧的分類、分群的技
術來輔助專業和非專業法學人士來檢索以中文撰寫的地方法院訴訟文書。對於檢索者而
言,我們希望能夠提高相關判例的檢索效率,同時這一系統也希望能夠有助於專業人士檢
索相關刑事案件的判刑刑度,藉此希望有助於法院判決的一致性。
電腦輔助 TIMSS 試題翻譯環境的研究,同樣也是結合人工智慧與自然語言處理的應用
研究,目的是協助 TIMSS 試題的翻譯。TIMSS 試題的原文是以英文撰寫的國際標準試題,
測驗的目的是要評比參與 TIMSS 計畫的各個國家的科學數理的教學成效。我國參與 TIMSS
計畫,因此須要把 TIMSS 試題翻譯為中文試題,好讓我國四年級和八年級(國中二年級)
的學生受測。我們建構了一個環境,希望能協助負責翻譯試題的專家,能夠以較低的時間
代價從事符合翻譯準則的翻譯工作。
電腦輔助國語科試題出題環境則是利用自然語言處理技術,協助國語科或者華語教師
編輯與華語學習相關的試題,好讓教師能夠透過網路從事測驗。這一個系統同時包含了試
題編輯、題庫管理、網路施測和測後分析等功能。試題的類型則包含的漢語語音辨識、改
錯字試題、中文克漏詞(cloze)、中文量詞和句子重組五個題型。
文獻探討
由於前述的四大項研究各有自己相關的文獻,因此無法在一篇報告中簡單地整合。除
了因為研究方向的重要差別,另外也因為相關文獻的量的關係,請有興趣的讀者與評審參
閱個別論文中的相關文獻探討的資料。
研究方法
我們分四個段落簡述四個不同的子項目的研究方法。詳細資料請參閱相關論文。
在建立使用者模型方面,我們首先建立一般適性化教學研究所依賴的模型,利用這樣
的模型來產生模擬的學生答題表現。有了答題表現的資料,我們才能進行下一步研究。在
研究中,我們比較了以經驗法則(heuristics)、類神經網路(artificial neural networks)和支持向
量機(support vector machines)所建構的分類器等技術來猜測先前用以產生模擬的學生資料
時所使用的貝氏網路模型。除了利用經驗法則來猜測的方法之外,我們須要利用監督式學
習法(supervised learning)來訓練類神經網路模型和支持向量機模型,這時我們假設有專業的
猜測,讓我們得以限縮所欲尋找的模型的範圍。實驗中,我們假設了學生的答題反應跟其
真實能力,只會呈現機率式的關連性,同時操弄這一關連性的不確定性,來研究經驗法則、
類神經網路和支持向量機所建構的分類器,在不同的程度的不確定性關連下所能達成的正
確性。
在中文訴訟文書檢索系統中,除了典型的 inverted indexing 之外,我們利用更多的自然
語言處理技術,建構不同的管道來協助查詢者找到有用的資料。這其中跟語意比較相關的
是我們採用了詞組(term pairs)為基礎的分群機制,讓我們來評比訴訟文書的相關度直覺上來
說。以詞組為檢索機制,比較能夠彰顯詞彙的語意。此外,我們也利用詞彙的同現(collocation)
3
來導引建議檢索檔案。跟我們以詞組為基礎來做檔案分群的理念相似,以同現的分數高低
來建議檢索資料,也可能因為比較能夠捕捉到檢索者的意圖而提高檢索效率。
電腦輔助 TIMSS 試題翻譯環境的建置是一個典型的機器翻譯(machine translation)的研
究。對於機器翻譯這個研究議題來說,兩年的計畫時程只能建立基礎而已。我們應用語言
模型(language models)、雙語對譯資料(parallel corpora)、範例式學習技術(example-based
learning)三個主要技術,結合現在受到學界普遍使用的 Moses 和 Lucene 開放式軟體工具建
立了一個翻譯輔助環境。本研究案,受到國立台灣師範大學科學教育中心的張主任的協助,
因此得以獲得相關的 TIMSS 中英文試題。
電腦輔助國語科試題出題環境提供五大類型試題的編輯:漢語語音辨識、改錯字試題、
中文克漏詞、中文量詞和句子重組。因此我們須要利用到語音、漢字構形、漢語詞彙和漢
語語法等數個不同層次的語文資訊。我們利用自然語言處理技術,依照試題編輯者(通常
是教師)所要求的試題條件,從所蒐集的語文資料找到相關的語料,並且依照所編輯的試
題的特性提出有用的建言。試題編輯者可以利用我們的介面建立基本的題庫,進而建立試
卷資料庫,爾後學生也可以透過網路作答。學生作答的結果可以立即得到回饋,教師也可
以分析所任課的學生群的測驗結果,檢討其教學策略。
研究成果與討論
我們分別簡述四個不同的子項目的研究成果。詳細資料(特別是個別研究的學術意義)
請參閱相關論文中比較詳細的討論。
在 建 立 使 用 者 模 型 方 面 , 我 們 在 International
Journal
of
Artificial
Intelligence
in
Education (IJAIED) 發表了一篇 49 頁的長篇論文[1],在這之前,我們在全國計算機會議發
表了一篇中文論文[12]為國內學者介紹這一個研究的縮影 。IJAIED 是一個優質的期刊,
是 International AIED Society 的正式期刊,由 University of Edinburgh 的教授擔任主編,一
年一般只收錄十餘篇論文,其中部分還是兩年一次的 AIED 學術研討會的最佳論文才能獲
得推薦。因此研究成果能夠在 IJAIED 刊登,應該算是相當不容易的一項成就。
在中文訴訟文書檢索系統方面,我們在 2007 年和 2008 年的人工智慧學會年會(TAAI)
發表了三篇論文[6, 13, 14]。
電腦輔助 TIMSS 試題翻譯環境的建置方面,我們在 Journal of Advanced Computational
Intelligence and Intelligent Informatics (JACIII)發表了一篇簡短的期刊論文[2],在 RANLP 國
際學術研討會中發表了一篇論文[5], 在 2007 年和 2008 年的計算語言學研討會(ROCLING)
上各發表了一篇論文[10, 17]。
因為所牽涉的問題,不僅僅是資訊科學的技術,同時還有關於教學的可能成效,因此
電腦輔助國語科試題出題環境的研究成果,部分是發表在比較接近教育領域的會議中,去
接受第一線的使用者的挑戰。這一方面的部分成果發表於 JACIII 期刊論文[2],2008 年的
ACL 國際學術研討會[3],2008 年的 CAERDA 學術研討會[4],2007 年的 RANLP 國際學術
研討會[5],兩篇 2008 年的計算語言學研討會(ROCLING)[8, 9]和一篇 2007 年的網際網路研
討會(TANET)[15]。ACL 是國際間計算語言學界最著名的國際學術研討會之一,研究成果
能夠獲得 ACL 年會收錄,是一項不錯的成就。
4
除了本份報告目前所報告的四項研究子項目之外,我們這一個研究計畫還做了一些嘗
試性質的研究,這一些嘗試性的研究偶而也有一些零星的論文發表。在研究生方面,這兩
年期間,有一為研究生曾經探討利用文件分類的技術來猜測新聞報導與股價漲跌趨勢的可
能關係[16],另有一位研究生探討利用文件內容的分析技術,來為研討會投稿論文找尋合適
的論文評審委員[11],這兩項研究經驗都發表在 ROCLING 研討會。此外,我們也有一位大
學部同學利用機器學習技術的觀念,發展出一個可以提供任意形狀棋盤的黑白棋(Reversi)
服務的軟體服務[7],這一向研究成果則發表於 TAAI 研討會。
論文列表
以下是因本項研究案所得以發表的學術論文清單
1.
Chao-Lin Liu. A simulation-based experience in learning structures of Bayesian networks to
represent how students learn composite concepts, International Journal of Artificial
Intelli-gence in Education, 18(3), 237‒285. IOS Press, The Netherlands, September 2008.
2.
Ming-Shin Lu(呂明欣), Yu-Chun Wang(王昱鈞), Jen-Hsiang Lin(林仁祥), Chao-Lin Liu,
Zhao-Ming Gao(高照明), and Chun-Yen Chang(張俊彥). Supporting the translation and
authoring of test items with techniques of natural language processing, Journal of Advanced
Computational Intelligence and Intelligent Informatics, 12(3), 234‒242. Fuji Technology
Press, Japan, May 2008.
3.
Chao-Lin Liu and Jen-Hsiang Lin(林仁祥). Using structural information for identifying
similar Chinese characters, Proceedings of the Forty Sixth Annual Meeting of the
Associa-tion for ComputaAssocia-tional Linguistics: Human Language Technologies (ACL’08), short paper,
93‒96. Columbus, Ohio, USA, 2008.
4.
Chao-Lin Liu, Jen-Hsiang Lin(林仁祥), and Chih-Bin Huang(黃志斌). A platform for
au-thoring test items for elementary Chinese with techniques of natural language processing,
presented in the 2008 CAERDA International Conference (Chinese American Educational
Research and Development Association). New York, New York, USA, 2008.
5.
Ming-Shin Lu(呂明欣), Jen-Hsiang Lin(林仁祥), Yu-Chun Wang(王昱鈞), Zhao-Ming
Gao(高照明), Chao-Lin Liu, and Chun-Yen Chang(張俊彥). Prototypes of using NLP
tech-niques for assisting translation and authoring of test items, Proceedings of the Workshop on
NLP for Educational Resources, International Conference on Recent Advances in Natural
Language Processing 2007 (RANLP’07), 1‒6. Borovets, Bulgaria, 2007.
6.
藍家良、賴敏華、田侃文及劉昭麟。訴訟文書檢索系統,
第十三屆人工智慧與應用研
討會論文集
(TAAI’08),305‒312。2008 年。
7.
林正宏及劉昭麟。任意棋盤的 Othello 遊戲,
第十三屆人工智慧與應用研討會論文集
(TAAI’08),443‒449。2008 年。
8.
劉昭麟、黃志斌、翁睿妤及莊怡軒。形音相近的易混淆漢字的搜尋與應用,
第二十屆
自然語言與語音處理研討會論文集
(ROCLING XX),108‒122。2008 年。
9.
賴敏華及劉昭麟。電腦輔助中學程度漢英翻譯習作環境之建置,
第二十屆自然語言與
語音處理研討會論文集
(ROCLING XX),293‒307。2008 年。
5
10. 張智傑及劉昭麟。以範例為基礎之英漢 TIMSS 試題輔助翻譯,
第二十屆自然語言與語
音處理研討會論文集
(ROCLING XX),308‒322。2008 年。
11. 陳禹勳及劉昭麟。電腦輔助推薦學術會議論文評審委員之初探,
第二十屆自然語言與
語音處理研討會論文集
(ROCLING XX),323‒337。2008 年。
12. 劉昭麟。利用試題反應建立學生學習歷程模型的一些經驗,
中華民國九十六年全國計
算機會議論文集
(NCS’07),第一冊(下),359‒366。2007 年。
13. 何君豪、鄭人豪及劉昭麟。階層式分群法在民事裁判要旨分群上之應用,
第十二屆人
工智慧與應用研討會論文集
(TAAI’07),794‒800。2007 年。
14. 鄭人豪及劉昭麟。中文詞彙集的來源與權重對中文裁判書分類成效的影響,
第十二屆
人工智慧與應用研討會論文集
(TAAI’07),801‒808。2007 年。
15. 林仁祥及劉昭麟。國小國語科測驗卷出題輔助系統,2007
臺灣網際網路研討會論文集
(TANET’07),論文光碟。2007 年。
16. 陳俊達、王台平及劉昭麟。以文件分類技術預測股價趨勢,
第十九屆自然語言與語音
處理研討會論文集
(ROCLING XIX),347‒361。2007 年。
17. 呂明欣、高照明、劉昭麟及張俊彥。針對數學與科學教育領域之電腦輔助英中試題翻
譯系統,
第十九屆自然語言與語音處理研討會論文集
(ROCLING XIX),407‒421。2007
年。
6
計畫成果自評
這一項研究計畫歷時兩年,原本的研究目標包含兩大方向,一個是模型建立技術的研
究,另一個則是與自然語言處理相關的研究。在這兩年之中,我們合計發表兩篇期刊論文,
三篇國際學術研討會論文和 12 篇國內學術會議論文。
在建立模型技術的研究方面,我們覺得有很值得自豪的成就,能夠在 International
Journal of Artificial Intelligence in Education (IJAIED) 發表長篇論文。IJAIED 是 AIED 學會
的代表期刊,而 AIED 的學術研討會和 ITS 學術研討會則是電腦輔助教學兩大旗艦級的國
際學術研討會。部分的 IJAIED 論文還是從 AIED 兩年一次的國際學術會議中精選而得的
(ITS 也是兩年一次的國際學術會議)。因此,我們主觀地相信以兩年多的努力來換取一篇
IJAIED 的論文是一項值得的投資。
相對之下,自然語言處理相關的研究的學術成果則顯得較為薄弱,由於研究計畫的規
模和過去兩年的兼任研究助理都還是只有由碩士班研究生來擔任,因此只能建立一些基礎
的經驗,僅僅在發表論文的數量和研究廣度上做努力。我們在電腦輔助法學資訊檢索,電
腦輔助機器翻譯和電腦輔助國語科試題編輯三個方面,都建置了真實可以在網路上使用的
軟體,除了為實驗室建立一些可用的軟體工具,為更深層研究建立基礎之外,最明顯可見
的成果可能是在於訓練可以進入職場的資訊科技人才。
7
附錄
本附錄依序包含下列四篇論文。
1.
Chao-Lin Liu. A simulation-based experience in learning structures of Bayesian networks to
represent how students learn composite concepts, International Journal of Artificial
Intelli-gence in Education, 18(3), 237‒285. IOS Press, The Netherlands, September 2008.
2.
Chao-Lin Liu and Jen-Hsiang Lin(林仁祥). Using structural information for identifying
similar Chinese characters, Proceedings of the Forty Sixth Annual Meeting of the
Associa-tion for ComputaAssocia-tional Linguistics: Human Language Technologies (ACL’08), short paper,
93‒96. Columbus, Ohio, USA, 2008.
3.
張智傑及劉昭麟。以範例為基礎之英漢 TIMSS 試題輔助翻譯,
第二十屆自然語言與語
音處理研討會論文集
(ROCLING XX),308‒322。2008 年。
4.
藍家良、賴敏華、田侃文及劉昭麟。訴訟文書檢索系統,
第十三屆人工智慧與應用研
A Simulation-Based Experience in Learning Structures of
Bayesian Networks to Represent How Students Learn
Composite Concepts
Chao-Lin Liu, Department of Computer Science, National Chengchi University, Taiwan [email protected]
Abstract. Composite concepts result from the integration of multiple basic concepts by students to
form high-level knowledge, so information about how students learn composite concepts can be used by instructors to facilitate students’ learning, and the ways in which computational techniques can as-sist the study of the integration process are therefore intriguing for learning, cognition, and computer scientists. We provide an exploration of this problem using heuristic methods, search methods, and machine-learning techniques, while employing Bayesian networks as the language for representing the student models. Given experts’ expectation about students and simulated students’ responses to test items that were designed for the concepts, we try to find the Bayesian-network structure that best represents how students learn the composite concept of interest. The experiments were conducted with only simulated students. The accuracy achieved by the proposed classification methods spread over a wide range, depending on the quality of collected input evidence. We discuss the experimental proce-dures, compare the experimental results observed in certain experiments, provide two ways to analyse the influences of Q-matrices on the experimental results, and we hope that this simulation-based ex-perience may contribute to the endeavours in mapping the human learning process.
Keywords. Student Modelling; Learning Patterns; Bayesian Networks; Computer-Assisted Cognitive Modelling;
Computer-Assisted Learning; Machine Learning
INTRODUCTION
Obtaining good student models is crucial to the success of computer-assisted learning. Relying on stu-dent models, computerised adaptive testing systems (CATs) may assess stustu-dents’ competence levels more efficiently than traditional pen-and-paper tests by adaptively selecting and administering appro-priate test items for individual students (van der Linden & Glas, 2000). If, in addition, a model cap-tures how students learn, then we may apply the model for computer assisted instruction and testing (Nichols et al., 1995; Leighton & Gierl, 2007). For instance, by introducing prerequisite relationships in a refined model, Carmona et al. (2005) showed that there is room for boosting the efficiency of CATs. In this paper, we adopt Bayesian networks (Pearl, 1988; Jensen & Nielsen, 2007) as the lan-guage to represent student models, and discuss a simulation-based experience in which we attempted to learn student models with machine-learning techniques based on students’ responses to test items. The simulation-based results indicate how and when we can learn students’ learning patterns from
their item responses, and shed light on some difficulties that we may encounter in similar studies that use the item responses of real students.
Measuring students’ competence levels with their responses to test items is a typical problem of uncertain reasoning in CATs. The slip and guess cases are two frequently mentioned sources of uncer-tainty, e.g., (VanLehn et al., 1994; Millán & Pérez-de-la-Cruz, 2002). Students may accidentally fail to respond to test items correctly (the slip case), or they may just be lucky enough to guess the correct answers to the test items (the guess case). Students may also make mistakes intentionally (Reye, 2004). Due to such an uncertain correspondence between students’ mastery levels and item responses, re-searchers and practitioners have applied probability-based methods for student assessment (Mislevy & Gitomer, 1996). Vos (2000) and Vomlel (2004), for instance, showed that probability-based proce-dures offer chances for teachers to correctly identify students’ mastery levels with a fewer total num-ber of test items in tests of variable length.
In recent years, Bayesian networks have offered a convenient computational tool for implement-ing the probability-based testimplement-ing procedures and also for cognitive and developmental psychology (Glymour, 2003). Martin and VanLehn (1995) and Mislevy and Gitomer (1996) studied the applica-tions of Bayesian networks for student assessment. Mayo and Mitrovic (2001) conducted a survey of this trend and applied decision theories to optimise their systems for intelligent tutoring. Conati et al. (2002) applied Bayesian networks to both assessing students’ competence and recognising students’ intention. The research on applications of Bayesian networks in CATs also led to real world perform-ing systems, e.g., SIETTE (Conejo et al., 2004; Guzmán et al., 2007b).
To apply Bayesian networks in an inference task, we need the network structure and the condi-tional probability tables (CPTs) that implicitly specify the joint probability distribution of all of the variables of interest. Just as we have to learn model parameters when we apply the Item Response Theory (van der Linden & Hambleton, 1997) in CATs, we have to learn the CPTs for Bayesian net-works (Mislevy et al., 1999) from students’ records, while experts often provide specifications of the network structures. The network structure essentially portrays the structure of the knowledge of the students in the study, and has an influence on the ways in which the decision mechanisms in CATs make inferences about students’ mastery levels.
Not surprisingly, researchers have explored different network structures in which the nodes for the variables were organised in different styles. For instance, Millán and Pérez-de-la-Cruz (2002) categorised nodes in their multi-layer Bayesian networks into four types: subjects, topics, concepts, and questions. Reye (2004) employed nodes that represented students’ competence as the backbone of the network, and associated a uniform substructure with each node on the backbone to assist the proc-ess of making inferences about students’ competence. Despite the differences in the network structures, both studies emphasised the importance of modelling the prerequisite relationships among the learning targets. Carmona et al. (2005) reported that adding prerequisite relationships in Bayesian networks helped reduce test lengths in CATs. In addition to utilising different categories of variables, research-ers may choose to let the nodes for concepts be parent nodes of nodes for test items, or the other way around. Mislevy and Gitomer (1996) and Millán and Pérez-de-la-Cruz (2002) discussed the implica-tions of the different choices which can be made in the direcimplica-tions of the links.
Although the majority of the CAT research community rely on experts to provide network struc-tures, it is conceivable that we may learn the network structures from students’ records using the ma-chine learning techniques for Bayesian networks (Heckerman, 1999; Jordan, 1999; Neapolitan, 2004). Vomlel (2004) attempted to apply a variant of the PC-algorithm (Spirtes et al., 2000) that was imple-mented in Hugin (http://www.hugin.dk) to learn network structures, and augimple-mented the networks with
hidden variables based on experts’ knowledge. Recently, Desmarais et al. (2006) learned item-to-item knowledge structures from students’ records, and compared the learned structures with those reported in (Vomlel, 2004). The item-to-item knowledge structures are special in that the states of all of the nodes in the networks are directly observable, making the learning of the network structures a rela-tively practical matter. The experience indicates that it is an interesting but challenging task to learn the network structures from scratch in the cases that there are many hidden variables, due in part to the large number of candidate network structures.
We approach the structure learning problem from a different perspective. Instead of trying to learn student models from scratch, we propose methods for helping experts select models that differ in subtle ways. This can be helpful for constructing student models for how students learn composite concepts. Assume that it requires knowledge of four basic concepts, say cA, cB, cC, and cD, to learn a composite concept dABCD. In this case, will we be able to tell whether students manage to learn dABCD by directly integrating cA, cB, cC, and cD or whether they first integrate cA, cB, and cC into an intermediate product and then integrate this intermediate product with cD? To what extent can the use of machine learning techniques help us to identify the direct prerequisites necessary for the pro-duction of the composite concept?
We explore methods to answer this question by expressing the problem with Bayesian networks and by learning the network structures based on students’ responses to test items. Although there are various methods for learning Bayesian networks (Heckerman, 1999; Neapolitan, 2004), our learning problem is distinct. We face a problem of learning the structure of hidden variables because we cannot directly observe students’ competence levels of the concepts. The students’ item response patterns that we can observe and collect have only an indirect and uncertain relationship with students’ actual com-petence patterns, which is a challenge that has long been discussed in the literature on CATs, e.g., (Martin & VanLehn, 1995; Mislevy & Gitomer, 1996). Although the states of the hidden nodes for the competence levels can only be inferred indirectly, we are sure of the existence of the hidden nodes, so our focus is to learn the structure that relates the hidden variables. Finally, for any practical problems that involve three or more basic concepts, there are at least four hidden variables in question, making the target problem nontrivial.
In order to explore the effectiveness of different computational techniques for the target problems, we employ the device of simulated students which has been used in many studies on methodologies for intelligent tutoring systems, e.g., (VanLehn et al., 1994; Vos, 2000; Mayo & Mitrovic, 2001; Millán & Pérez-de-la-Cruz, 2002; Liu, 2005; Desmarais et al., 2006; Matsuda et al., 2007). We gener-ated the item responses of the students that were simulgener-ated with a specific Bayesian network whose structure encoded beliefs about how students learned composite concepts. We could control the degree of uncertainty in the relationship between the item responses and the mastery levels by adjusting the simulation parameters. Hiding the original Bayesian network, we applied mutual information (MI) (Cover & Thomas, 2006), search-based methods, artificial neural networks (ANNs) (Bishop, 1995), and support vector machines (SVMs) (Cortes & Vapnik, 1995) to analyse students’ item responses to determine the structure of the original network.
We report experimental results and discuss observations that are potentially useful for further studies. The quality of the predictions that are made by our classifiers depends on many factors, e.g., the algorithms that we used to guess the network structures, the degree of uncertainty in the relation-ships between the students’ competence levels and the item responses, and the quality of the training data for the machine-learning algorithms. On average, using SVMs as the underlying classification mechanism offers the best performance and efficiency, when training data of good quality is available.
Experimental experience provides hints on the principles that are useful for guiding the designs of fur-ther studies. More specifically, we identify some methods for determining the quality of training data, provide two analytical methods for comparing the influences of Q-matrices on the experimental results, and report situations when different classification methods may offer better performance. Specific de-tails will be discussed in appropriate sections.
We define the target problems and provide background information in Preliminaries†, discuss the applications of mutual information, search-based methods, artificial neural networks, and support vec-tor machines to the problems in Methods for Model Selection, and present the design of experiments in Design of the Experiments. In Idealistic Evaluations, we evaluate and compare the effects of the proposed methods under different combinations of slip, guess, and Q-matrices, when the quality of training data is good. In More Realistic Evaluations, we investigate the results of experiments under different combinations of slip, guess, and Q-matrices, when the quality of training data is relatively poor. Finally, we summarise the implications of the simulation results and review more relevant litera-ture in Summary and Discussion.
PRELIMINARIES
We outline the nature of the problems that we would like to solve in the first subsection, and explain how we formulate the target problems with Bayesian networks in the second subsection. Using Bayes-ian networks as the representation language, we provide a more precise definition of the target prob-lem in the third subsection, show how we simulate students’ item responses in the fourth subsection, look into the issue about computational complexity in the fifth subsection, and illustrate the difficulty of solving the target problems with existing software in the last subsection.
The Simulated World
We consider a set of concepts C)and an item bank ℑ that contains test items for C). Some concepts in C) are basic and others are composite. Learning a composite concept requires the students to integrate their knowledge about certain basic concepts. A composite concept, say dABC, is the result of integrat-ing knowledge about basic concepts cA, cB, and cC. Let C) contain n concepts, i.e.,
} , , , {C1C2 Cn
C)= L . For each concept Cj C )
∈ , we have a subset ℑj={Ij,1,Ij,2,L,Ij,mj} in ℑ for test-ing students’ competence in Cj. For easier reference, we call Cj the parent concept of the items in
j
ℑ . The concepts that students directly integrate to form a composite concept Ck are also referred as the parent concepts of Ck. Based on this definition, a prerequisite concept is not necessarily a parent concept of a composite concept. More specifically, cA and cB are not parent concepts of dABC when, for instance, students learn dABC by integrating dAB and cC, although cA and cB must be prerequi-sites of dABC. We refer to a student’s competence in the concepts being studied as a competence
pat-tern, and assume that students demonstrate special patterns in their competence. Students that share
the same competence patterns form a subgroup.
We employ the convention of the Q-matrix, originally proposed to represent the relationships be-tween concepts and test items (Tatsuoka, 1983), for the encoding of the competence of a subgroup in
† We use the font of Helvetica for section headings to avoid the need to use numbered section headings.
the basic concepts and also in being able to integrate the parent concepts into composite ones. In Table 1, there are two Q-matrices that are separated by the double bars, and the “SID” column shows the identification of the subgroups. We will use these Q-matrices in the experiments reported in Idealistic Evaluations and More Realistic Evaluations. Let qj,k denote the cell at the jth row and the kth col-umn in a Q-matrix. If Ck is a basic concept, we set qj,k to 1 when students of the j
th subgroup has the competence in Ck; if Ck is a composite concept, we set qj,k to 1 when students of the jth subgroup has the ability to integrate all of the parent concepts of Ck. Hence, if the kth concept is composite, the jth subgroup is competent in the concept only if 1
,k= j
q and the jth subgroup is competent in all of the parent concepts of the kth concept. Based on this definition,
k j
q, is related to both the rule nodes and the rule application nodes that are defined by Martin and VanLehn (1995).
The competence patterns, which are used in our simulations, are not as deterministic as they ap-pear. In the simulations, we intentionally introduce some degrees of uncertainty to reflect the possibil-ity that teachers may not categorise the subgroups precisely. This is similar to the concept of residual ability discussed in (DiBello et al., 1995, page 362). We will go further into this issue when we present our simulator in Generating Student Records.
As discussed in (DiBello et al., 1995, pages 365 and 370), we can apply Q-matrices in different ways, depending on the interpretation of the rows and columns. In addition, the contents of the matri-ces can differ in a wide variety of ways, and, consequently, researchers can report results of experi-ments using a selected number of Q-matrices typically. Different choices of the Q-matrices certainly influence the results of our experiments, and we will discuss this issue shortly.
Example 1. In the Q-matrices shown in Table 1, we assume that students form only eight subgroups,
although there could be 27 subgroups in a problem that includes seven concepts. The competence pat-tern for the subgroup g8 in the left Q-matrix is {1, 1, 1, 0, 0, 0, 1}. By adopting the left Q-matrix, we assume that a typical student in g8 should be competent in all basic concepts, should be able to inte-grate the parent concepts for dABC, but cannot inteinte-grate the parent concepts for dAB, dBC, and dAC at the time of the experiments. ■
A Formulation with Bayesian Networks
We choose to use Bayesian networks to represent student models, because Bayesian networks are a popular choice for researchers to capture the uncertain relationship between students’ performance and
Table 1. Competence patterns in two Q-matrices Competence in (integrating) concepts
SID cA cB cC dAB dBC dAC dABC cA cBcC dAB dBC dACdABC
g1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 g 2 1 1 1 0 1 1 1 1 1 1 1 1 0 1 g 3 1 1 1 1 1 0 1 1 1 1 1 0 1 1 g 4 1 1 1 1 0 1 1 1 1 1 1 0 0 1 g 5 0 1 1 0 1 0 1 1 1 1 0 1 1 1 g 6 1 1 0 1 0 0 1 1 1 1 0 1 0 1 g 7 1 0 1 0 0 1 1 1 1 1 0 0 1 1
g 8 1 1 1 0 0 0 1 1 1 1 0 0 0 1 their competence in many research projects, e.g., (VanLehn et al., 1998; Mislevy et al., 1999, Millán & Pérez-de-la-Cruz, 2002; Reye, 2004; Vomlel, 2004; Carmona et al., 2005; Chang et al., 2006; Al-mond, 2008). We employ nodes in Bayesian networks to represent students’ competence in concepts and the correctness of their responses to test items. For easier recognition, we use the names of the concepts as the names of the nodes that represent the concepts. The names of the nodes that represent the correctness of the item responses are in the form of iXα, where i denotes item, X is the name of the parent concept, and α is the identification number of the test item. When there is no risk of confusion, we refer to the nodes that represent concepts simply as the concepts and the nodes that represent test items simply as test items. Hence, in Figure 1, we have seven different conceptsthree basic ones (cA, cB, and cC) and four composite ones (dAB, dBC, dAC, and dABC). As a simplifying assumption, each simulated student will respond to three test items designed for every concept. For instance,
} 3 , 2 , 1 {iAiAiA cA=
ℑ , and iA1, iA2, and iA3 are test items for cA.
All nodes are dichotomous in our simulation, except for the group node. In all simulations, group will be used as a special node that represents the student subgroups, and it can have such values as g1, g2, …, and gγ, where γ depends on the design of the simulations. Nodes representing competence lev-els may have either competent or incompetent as their values, and nodes representing item responses may have either correct or incorrect as their values.
The links in a Bayesian network signify direct relationships between the connected nodes, and the nodes that are not directly connected are conditionally independent (Pearl, 1988; Jensen & Nielsen, 2007). There are no strict rules governing the directions of the links in Bayesian networks, except that a valid Bayesian network must not contain any directed cycles and that it is recommended that we fol-low the causal directions in model construction (Russell & Norvig, 2002). The literature has discussed the implications of different choices of the directions of the links for CATs, e.g., (Mislevy & Gitomer, 1996; Millán & Pérez-de-la-Cruz, 2002; Glymour, 2003; Liu, 2006d). We employ the most common choices, and discuss relevant issues in Impacts of Latent Variables and Summary and Discussion. As a result, links point from the parent concepts to the integrated concepts and from the parent con-cepts to their test items.
In Figure 1, the values of group come from the set of possible student subgroups. If we use either of the two Q-matrices in Figure 1, group will have eight possible values, each denoting a possible stu-dent group. Since the subgroup istu-dentity of a stustu-dent affects the competence pattern, there are direct links from group to all concept nodes.
We defer the discussion of how we set the contents of the conditional probability tables to Gen-erating Student Records.
group
cA cB cC
dAB dBC dAC dABC
iA1 iA3 iA2 iC2 iC3 iC1 iB2 iB3 iB1
iAB1iAB2iAB3
iABC1iABC2 iABC3 iAC3 iAC2 iAC1 iBC3 iBC2 iBC1
The Target Question and Assumptions
Our target problem is to learn how students learn composite concepts by observing students’ fuzzy (Bi-renbaum et al., 1994) item-response patterns that have only an indirect relationship with their compe-tence patterns. Students’ item responses are fuzzy because they do not necessarily indicate students’ actual competence.
A composite concept is a concept that requires the knowledge of two or more basic concepts. For instance, Mislevy and Gitomer (1996) used “Mechanical Knowledge”, “Hydraulics Knowledge”, “Canopy Knowledge”, and “Serial Elimination” as the prerequisites for “Canopy Scenario Requisites--No Split Possible”, and Vomlel (2004) included “Subtraction”, “Cancelling Out”, and “Multiplication” as the basic capabilities that are necessary for finding the solution for (43×65)−81.
Although it is convenient to use the nodes for all the prerequisites as the parent nodes of the node for the composite concept, we anticipate that constructing a more precise model that reflects the proc-ess of the learning of the composite concept may improve the performance of CATs and other com-puter-assisted learning tasks. This anticipation is related to the study of cognitive diagnostic assess-ment (Nichols et al., 1995; Leighton & Gierl, 2007). Indeed, Carmona et al. (2005) report that intro-ducing prerequisite relationships into their multi-layered Bayesian student models enables their CAT system to diagnose students with a fewer number of test items. Furthermore, if teachers know how students normally learn a composite concept, the teachers will have more information as to how to provide appropriate and specific help for students who fail to demonstrate competency in the concept (Naveh-Benjamin et al., 1995). For instance, if students normally learn dABC by integrating cA and dBC and if a student shows a lack of competence in dABC, a teacher may have to consider the stu-dent’s ability in learning dBC from cB and cC in addition to providing the student with information about the three basic concepts. Using Vomlel’s arithmetic problem as an example, we are wondering how computational techniques can help us compare the merit of the (partial) Bayesian networks shown in Figure 2.
Therefore, we consider the problem of how the use of computational techniques can help us iden-tify students’ learning patterns. To facilitate the discussion about the ways in which a composite con-cept may be learned, we define the notation that we will use to represent how students learn a compos-ite concept. Let τ denote the composite concept which we would like to know how students learn. As-sume that there are α basic concepts included in τ. Based on our non-overlapping assumption that we present below, τ can have at most α parent concepts. If some of τ’s parent concepts are composite, τ will have less than α parent concepts. We denote a way of learning τ by a computational form of τ . A computational form of τ may have one or more parts, the parts are connected by underscores, and each part of the computational form represents a parent concept of τ.
Definition 1. Assume that learning κ requires the knowledge of α basic concepts. Let {π1,π2,…,πμ}
(3/4 * 5/6) - 1/8 Multiplication Cancelling out Subtraction (3/4 * 5/6) - 1/8 Multiplication Cancelling out Subtraction C & M
Fig. 2. Which model is better?
denote the set of parent concepts of κ, where μ≤α. The computational form of the way to learn κ is π1_π2_…_πμ. Each computational form of a composite concept represents a learning pattern for stu-dents to learn the composite concept.
Definition 2. (The non-overlapping assumption) We assume that any two parent concepts defined in
Definition 1 do not have common basic concepts.
The non-overlapping assumption presumes that students must learn composite concepts from non-overlapping components. Specifically, the parent concepts of the composite concepts do not in-clude common basic concepts. Hence, there are only four possible ways to learn dABC: (1) integrating cA, cB, and cC directly (denoted by A_B_C); (2) integrating dAB and cC (denoted by AB_C); (3) inte-grating dBC and cA (denoted by BC_A); and (4) inteinte-grating dAC and cB (denoted by AC_B). The structure shown in Figure 1 is A_B_C. Figure 3 shows three other ways to learn dABC, and, from the left to right, they are AB_C, BC_A, and AC_B (Nodes for test items are not included for readability of the networks in Figure 3 and other Bayesian networks that we will discuss later in this paper.)
The non-overlapping assumption simplifies the space of the possible answers. Without excluding the possibility of overlapping ingredient concepts, we would have to consider AB_BC, AB_AC, and BC_AC if we minimise the number of overlapping basic concepts. We would also have to consider cases like AB_BC_A and even AB_BC_AC_A if we do not minimise the number of overlapping basic concepts. It is certainly possible that a student can learn dABC with these alternative methods. How-ever, we leave these more challenging possibilities for future studies.
As we present more details about the designs of our experiments, it will become clear that the methods we propose do not require the application of the non-overlapping assumption. However, mak-ing use of the assumption simplifies the space of the possible solutions, while the proposed methods can still be applied without the assumptions.
We do not assume further limitations on the ways that students might integrate the candidate par-ent concepts. For instance, under some circumstances, one might believe that a studpar-ent cannot inte-grate cA and cB unless cA is already a part of another relevant concept, say cAC. In this case, one might learn dABC from dAC and cB but not from dAB and cC. We did not consider such special con-straints in our study.
Definition 3. (The common assumption) All students learn a composite concept with the same
learn-ing pattern.
The common assumption presumes that all students use the same strategy to learn a composite concept. The purpose of using this assumption is just to simplify the presentation of our discussion. It is understood that there is no clear support for this rather controversial assumption. However, the cur-rent goal of our methods is to select exactly one best candidate from the many possible ways of learn-ing the composite concept. It will become clear, as we present our methods in the rest of this paper, that we can easily modify our methods to select the top k candidate solutions for human experts to
group
cA cB cC
dAB dBC dAC dABC
group
cA cB cC
dAB dBC dAC dABC
group
cA cB cC
dAB dBC dAC dABC
Fig. 3. Three other ways to learn dABC (from left to right): AB_C, BC_A, AC_B
make the final judgment about how students may learn the composite concept. We simply have to pre-sent the k highest-scored candidate structures to the experts to relax the common assumption. There-fore, we hope this assumption is not as provocative as it might appear.
In summary, we would like to find ways to tell which of the candidate networks, e.g., those in Figure 3, was used to generate the simulated students records.
Generating Student Records
The contents of the conditional probability tables (CPTs) of the Bayesian networks were generated based on a Q-matrix (e.g., those contained in Table 1), a given network structure (e.g., those shown in Figures 1 and 3), and simulation parameters according to the methods described in (Liu, 2005). When generating the CPTs, we considered not only the chances of slip and guess but also the chances of stu-dents’ abnormal behaviours that deviated from the typical competence patterns of the subgroups to which they belonged. To capture the uncertainty of this latter type, we inherited the concepts of group guess and group slip discussed in (Liu, 2005), but set both group guess and group slip to
groupInflu-ence. More precisely, when qj,k=1, we assigned a high probability for the j
th subgroup being compe-tent in the kth concept (if
k
C is basic), and this probability is sampled uniformly from [1-groupInfluence, 1], where groupInfluence is a simulation parameter selected for individual experi-ments. Hence, even if qj,k=1, Pr(Ck=competent|group=gj) might not be equal to 1, and students of the jth subgroup might not be competent in the kth concept. Similarly, when 0
,k= j
q , we assigned a low probability for the jth subgroup being competent in the kth concept (if
k
C is basic), and this probability is sampled uniformly from [0, groupInfluence]. Hence, even if qj,k=0, students of the jth subgroup might be competent in the kth concept.
The conditional probabilities of correctly responding to test items given different competence levels were specified with a standard procedure that has been commonly employed in the literature, e.g., (Martin & VanLehn, 1995; Mayo & Mitrovic, 2001; Conati et al., 2002; Millán & Pérez-de-la-Cruz, 2002). Instead of using two simulation parameters for slip and guess, we set these two parame-ters to the same value and called it fuzziness. Hence the probabilities
) competent | correct
Pr(Ij,k= Cj= and Pr(Ij,k=correct|Cj=incompetent)were, respectively, sampled uniformly from [1- fuzziness, 1] and [0, fuzziness]. Notice, again, that the value of fuzziness functioned as the bounds of the actual values of guess and slip but not their values.
Similar to what has been reported in the literature, e.g., (DiBello et al., 1995; Mayo & Mitrovic, 2001; Conati et al., 2002), we employed the concept of noisy-and (Pearl, 1988) for setting the condi-tional probabilities for the composite concepts which have multiple parent nodes. Noisy-and nodes reflect a probabilistic version of the “AND” relationship in traditional logics. The degree of noise is controlled by the simulation parameter groupInfluence. Readers are referred to (Liu, 2005) for more details.
We controlled the percentages of the subgroups in the entire simulated student population by ma-nipulating the prior distribution over the node group. We could use any prior distribution for group in the simulator. In the reported experiments, the node group took the uniform distribution as its prior distribution. Hence, if we were simulating a population of 10000 students that consisted of eight sub-groups, each subgroup might have approximately 1250 students.
In summary, we created Bayesian networks with the procedure reported in (Liu, 2005), and we controlled the degree of uncertainty by two parameters, i.e., groupInfluence and fuzziness. Given the network structure and the CPTs, we had a functioning Bayesian network, and could apply this network to simulate item responses of different types of students. We employed a uniform random number generator in simulating students’ behaviours with a typical Monte Carlo simulation procedure. For instance, we randomly sampled a number, δ, from a uniform distribution [0, 1]. If the conditional probability of correctly responding to iA2 was 0.3 for a particular subgroup of students and if δ >0.3, we would assume that this student responded to iA2 incorrectly. Students of the same subgroup may have different item responses to the same item because we independently drew a random number for each test item and each simulated student.
Example 2. Table 2 shows the data for certain students that we generated with the Bayesian network
shown in Figure 1 and the left Q-matrix shown in Table 1, when setting groupInfluence and fuzziness to 0.05 and 0.10, respectively. Each row in Table 2 contains a record for a simulated student, e.g., the first simulated student correctly responds to all of the test items while the second simulated student fails iA3 and iABC3. Although we always simulate item responses for students of all of the subgroups, we cannot show all of the data here. Notice that, due to the degree of uncertainty which was simulated and which was controlled by groupInfluence and fuzziness, a student who should be competent in a concept might not respond correctly to a test item for that concept. For instance, the second student of g1 fails to respond correctly to iA3, although all the members of g1 are supposed to be competent in cA as indicated by the Q-matrix.■
Computational Complexity
Assume that there are β basic concepts in C) . The computational complexity of our target problem comes from both the number of different ways that students can learn the composite concept which, directly or indirectly, integrates all β basic concepts and the number of different Q-matrices.
Table 2. A sample of simulated students’ item responses for the Bayesian network shown in Figure 1 and the left Q-matrix in Table 1 (1 and 0 denoting correct and incorrect, respectively)
Test Items
group
iA1 iA2 iA3 iB1 iB2 … iAB1 iAB2 iAB3 iBC1 … iABC1 iABC2 iABC3
g1 1 1 1 1 1 … 1 1 1 1 … 1 1 1 g1 1 1 0 1 1 … 1 1 1 1 … 1 1 0 g1 1 1 1 1 1 … 1 0 1 1 … 1 1 1 … … g2 1 1 0 1 1 … 0 1 0 1 … 1 1 0 g2 1 1 1 1 0 … 1 0 0 1 … 1 1 1 … … g5 0 0 0 1 1 … 0 0 0 1 … 0 0 1 g5 0 1 0 1 1 … 1 0 0 1 … 0 1 0 … … g8 1 1 1 0 1 … 0 0 0 0 … 1 1 1 g8 1 1 1 1 1 … 0 1 0 0 … 0 1 1 … …
Given the non-overlapping assumption and the common assumption, the number of different ways that students can learn the composite concept which integrates all β basic concepts is related to the Stirling number of the second kind (Knuth, 1973). Formula (1) shows the number of ways to parti-tion t different objects in exactly i nonempty sets.
∑
− = ⎟⎟⎠ − ⎞ ⎜⎜ ⎝ ⎛ − = 1 0 ) ( ) 1 ( ! 1 ) , ( i j t j j i j i i i t S (1)Formula (2) shows the number of ways to partition β different objects in more than two nonempty sets, and Table 3 illustrates how the number of possible learning patterns grows with β. S)(β) is the number of possible ways to learn a composite concept from β basic concepts.
∑ ∑ ∑ = − = = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − ⎟⎟ ⎠ ⎞ ⎜⎜ ⎝ ⎛ − = =β β β β β 2 1 0 2 ! (1) ( ) 1 ) , ( ) ( i i j j i j i j i i i S S) (2)
The choice of the Q-matrix influences the prior distribution for the students being simulated, and is an important issue for studies that employ simulated students (VanLehn et al., 1998). There can be a myriad number of different Q-matrices, cf. (DiBello et al., 1995), and clearly the chosen Q-matrix af-fects the difficulty of identifying the learning pattern of interest. When there are β basic concepts in C) , there can be as many as n=2β-1 different concepts in C), and there can be as many as 2n different com-petence patterns, as we have explained in Example 1. In principle, a student can belong to any of these 2n patterns. Because each of these 2n patterns can be either included or not included in the Q-matrix, there are 22n different Q-matrices. Note that such quantities occur only in the worst-case scenario as not all of these 2n patterns and not all of the 2β-1 concepts are practical.
We can choose to include all possible competence patterns in a Q-matrix, or, alternatively, we can make the Q-matrix include only those patterns that appear to be helpful for identifying the learning patterns. In the former case, there is only one possible Q-matrix, but the size of this Q-matrix will be quite large. For β =3 and β =4, the Q-matrices will include, respectively, 128 and 32768 competence patterns. In the latter case, the selection of Q-matrices is equivalent to choosing a certain population of students to participate in our studies in order that we can achieve our goals. For instance, all of the values in the dABC columns of the Q-matrices in Table 1 are set to 1. As explained in Generating Stu-dent Records, such a setting makes the simulated stuStu-dents very likely to be able to integrate the parent concepts of dABC to learn dABC, and, if we want to learn how students learn dABC, it should be rea-sonable to recruit students who appear to be competent in dABC in our studies. Hence the choice for the settings of the dABC columns of the Q-matrices in Table 1 is not groundless. We will discuss the influence of Q-matrices in more detail in Influences of the Q-Matrices and More Realistic Evalua-tions when we present the experimental results.
Example 3. Based on this discussion, we choose to report results for interesting Q-matrices in which
there are only three or four basic concepts. For the C) used in Table 1, β=3 and n=7. There are four different ways to learn the composite concept dABC, 128(=27) different competence patterns, and 2128 possible Q-matrices, so there are 2130 (=4×2128) problem instances. For the problem in which we
con-Table 3. Results of computing Formula (2) grow exponentially with β
β 3 4 5 6 7 8 9 10
) (β
S) 4 14 51 202 876 4139 21146 115974
sider four basic concepts (i.e., β=4), there will be 14 different ways to learn dABCD based on Formula (2). A complete enumeration of the subsets of {A, B, C, D}, without considering the empty subset, in-cludes 15 configurations, which makes n =15 in C) (cf. Table 7). Hence, for this case, we have 32768(=215) competence patterns and 14×232768 different problem instances. ■
Impacts of Latent Variables
In addition to the large search space that was discussed in Computational Complexity, another major difficulty in learning the learning patterns comes from the fact we cannot directly observe the levels of competence of the students. What we have at hand are students’ responses to test items that are indi-rectly and probabilistically related to the actual competence levels. The literature, e.g., (Heckerman, 1999), has addressed common issues in learning network structures with hidden variables, and some, e.g., (Desmarais et al., 2006), have discussed issues that are specific to learning network structures for educational applications. In this subsection, we look into problems that are directly related to our tar-get problems.
If we could directly observe the states of competence levels of concepts, we would be able to ap-ply theoretical inference tools. Let CI(X, Y, Z) denote the situation that variables in X and Z become independent when we obtain information about the variables in Y. For simplicity, we say X and Z are conditionally independent given Y when CI(X, Y, Z) holds. (Note that X, Y, and Z may contain one or more variables.) Take the case for learning dABC as an example. If we can directly observe the states of group, cA, cB, cC, dAC, and dABC, we will find that CI(dAC, {group, cA, cB, cC}, dABC) if the actual structure is the network shown in Figure 1. We can tell whether the conditional independence holds based on the criteria for judging whether d-separation (Pearl, 1988) holds in Bayesian networks, and the data generated with this network are expected to reflect the independent relationship. Hence, we would be able to tell the learning pattern for dABC by checking whether CI(dAC, {group, cA, cB, cC}, dABC) and other relevant conditional independence relationships hold.
In reality, we cannot directly observe the states of group, cA, cB, cC, dAC, and dABC, and can observe only the states of the test items for cA, cB, cC, dAC, and dABC, i.e., the states of iAj, iBj, iCj, iACj, and iABCj, where j=1, 2, 3. This information is helpful but does not allow us to determine the answer to the problem for sure, because CI({iACj|j=1,2,3},{iAj, iBj, iCj|j=1,2,3}, {iABCj|j=1,2,3}) fails to hold for any structure shown in Figures 1 and 3 now. In Figure 1, even if we further assume the availability of information about group either because of students’ records or because of the help of student assessment software, nodes iACj and nodes iABCj, j=1,2,3, remain probabilistically dependent. In this network, only direct information about the competence levels, i.e., cA and cC, or either of dAC and dABC, can d-separate nodes iACj and nodes iABCj, j=1,2,3. As a consequence, if we can observe only the states of the nodes for test items, we cannot tell the difference among different ways of learn-ing dABC based on the concept of d-separation.
The research into learning Bayesian networks from data has made significant progress in recent years (Heckerman, 1999; Neapolitan, 2004). Yet, the problem of learning Bayesian networks with hidden variables is relatively more difficult. Based on our limited knowledge, existing algorithms can tackle problem instances that consider a limited number of hidden variables but such algorithms do not explicitly attempt to learn the relationships among a set of hidden variables, which is the focus of this paper. In addition to the consideration of hidden variables, a further major technical challenge in learn-ing Bayesian networks is misslearn-ing values in some of the trainlearn-ing data. We disregard this consideration at this moment, though it is possible for a real student not to answer all the questions in a test. We