The Cascading Forum Topic Mining Algorithm for Self-Organized Ontology Maintenance

Chapter 5 Behavior modeling of programming inquiry activity on the learning forum

5.3 The Cascading Forum Topic Mining Algorithm for Self-Organized Ontology Maintenance

With the initial version of Purpose-based Ontology which is given by domain experts, it can be used to conduct the topics discovery in the forum. Since the forum documents are incrementally inserted, the Self-Organized Ontology Maintenance Scheme is proposed. As shown in Figure 5.3, there are five processes which are Initialized purpose editing, purpose classification, topic clustering, Ontology updating, and new documents inserting.

Figure 5.3 The self-organized ontology maintenance scheme

Firstly, in the initial purpose defining process, the domain expert is required to edit how many purposes s/he wants to analyze in purpose layer of the ontology.

Secondly, the training data can support the construction of classifier for purpose classification. Thirdly, in the topic clustering process, if the Topic Density of some topic is lower than the threshold, the documents number associated with the topic is too large. Thus, the clustering process is applied to reorganize the original topic into sub-topics. Fourthly, the ontology updating process revises the original ontology to new version. Finally, the new documents inserting process keep creating new node and associating to the ontology for the inserted form documents. It periodically checks the Topic Density of ontology and applies the topic clustering to self-organize the topics of the ontology after a number of documents are inserted. Thus, the Purpose-based Ontology can be periodically enhanced and maintained. The Self-Organized Ontology Maintenance Algorithm is shown in following algorithm.

Algorithm : The Self-Organized Ontology Maintenance Algorithm

Step 1. In the initial Purpose Editing process, the domain experts provide the initial inquiry purposes of learning domain to classify the documents.

Step 2. The classified documents as initial training set to perform the purpose classification analysis.

Step 3. For each topic, if the Topic Density (TD) value is larger than the threshold which means that there are too many documents classified in one topic, the Topic Clustering process is triggered to cluster the documents into groups of sub-topics.

Step 4. In the Ontology Updating process, the original ontology is updated with the new clustered sub-topics.

Step 5. While a new document is inserted, classify the document into the specific purpose and

insert into the most similar topic in the Purpose-Ontology. If the Topic Density of the inserted topic node is acceptable then stop, else go to Step 3 and update the ontology.

In this section, the cascading forum topic mining algorithm is proposed to discover the hot topics.

1) Purpose Classification

Generally speaking, the forum documents are composed of the title and the content body. The title may consist of the question words represent the purpose of question; the content body may consist of either detailed question descriptions or the answer phrases corresponding to the purpose of question. According to researches about question analysis [75], most of the question patterns can be represented as

“question word + domain keywords”, where the question word is one of the interrogatives (What, How, Why, etc.) and the domain keywords represents the keywords in the subsequent chunks that tend to reflect the intended answer more precisely. Therefore, with the manually constructed initial Purpose-based ontology and the training data, the question patterns are extracted for purpose classification with the question analysis and document structure information.

 Question pattern

For different purposes of documents, there are usually different question patterns such as different interrogatives and various adjective terms. The interrogatives and adjectives can be formed as the question patterns.

 Answer pattern

Besides the question pattern, the different purposes of the documents may be predicted from the answer patterns.

To classify the forum documents into purposes, the lexical pattern matching approach in different structure level is used. For example, the title keywords set can explicitly express the document purpose with question phrases starting with interrogatives such as “what”, “how”, “why”, etc. There are some example patterns of purposes “what‟s the meaning”, “what‟s wrong”, “how to do” and “how to use” as shown in Table 5.2.

Table 5.2 The question and answer keyword-pattern examples of different purposes

Structure Level Purpose Keyword-pattern Question-pattern

example

What‟s the meaning what is, meaning, definition, … What‟s wrong bug, error, problem, help, correct,

why, can‟t,…

What‟s Difference comparison, difference, the relation of, …

How to do how to, how to implement, functionality, can use,…

How to use use, call function, … Answer-pattern

example

What‟s the meaning define as, used for, refer to,…, What‟s wrong because, maybe, …

How to do can use, call function,…

How to use parameter is, for example,…

Therefore, in each level, the defined patterns are represented as the Boolean features vector for the classification algorithms [49][55][57][62]. If the defined pattern appears in the document, then the feature value is set as 1. Therefore, with the defined features

and training set, the purpose classifier can be constructed as shown in Figure 5.4.

Since there are usually a group of discussions posted in the forum documents by following the same title, the Purpose Classifier by Content is applied firstly to identify the question or answer patterns in the content. It stops if the document can be classified; otherwise the Purpose Classifier by Title is applied for further classification by the terms in the title part.

Figure 5.4. The structure-based purpose classifier

2) The Multi-Level Document Distance

In this dissertation, we apply the Term Frequency - Inverse Document Frequency (TF-IDF) weighting scheme [4][26][68][76] to represent the topics of documents.

Each document can be represented by a vector <tf1×idf₁, tf2×idf₂,…, tfn×idf_n>, where tfi

is the frequency of the i-th term, idfi=log(n/df(t)) is the Inverse Document Frequency (IDF) of the i-th term in the document, n is total number of documents and df(t) is the number of documents that contains the term.

To calculate the semantic distance of document issues, the C++ Domain Keyword Ontology is used. The keywords can be collected from the index of textbooks and online documents. The categories of the domain keywords include “platform”,

“algorithm”, “Program Statement”, “Bug description”, “GUI”, etc. The leaves of the

concepts are the keyword sets to describe the concept. For example, the concept

“API” has the sub concepts “DLL”, “LIB”, etc. as shown in Figure 5.5.

Figure 5.5. C++ domain ontology

For conveniently creating the relationships among concepts according to the ontology structure, we assume that each sub class of C++ Domain ontology will have the same depth. However, in general, the depths of concept structures are different.

Therefore, in C++ Domain ontology, if the depth of a leaf concept is too short, the Virtual Node (VN) will be repeatedly inserted as its child node until the difference of

the desired depth has been filled.

Accordingly, the semantic distance between two documents can be calculated by the weighted sum of the ontology distance from bottom level to root level. The bottom level has the highest weight and the higher the levels, the lower the weights.

With the ontology structure described above, let the depth of domain ontology be h, the i-th element of document in level  are represented as Ui(_) calculated by weighted sum of multi-level distance can be defined as follows.

Definition 5.9 The weighted sum of Multi-Level Document Distance (MLDD) The Euclidean Distance of two keyword vectors in level  is represented as U^(l) and V^(l).

There are three documents with original keyword vectors Doca=<1, 0, 0>, Docb=<0, 1, 0> and Doc_c=<0, 0, 1>. With the definition of weighted sum of multiple levels document distance, the distance measurements among documents are MLDD(Doca, Doc_b) = 2 and MLDD(Doc_a, Doc_c) = 2^{+ 0.8*} 2. As shown in Figure 5.6, although the distance among them in the original keyword vectors are the same, the documents within the same class of concepts tend to be more similar.

Figure 5.6 The example of documents distance

3) Cascading Topic Clustering Algorithm

With the defined Purposes-based Ontology, the cascading clustering algorithm is applied for the topic discovery. Firstly, the Purposes-based Ontology is referred to

classify the documents into the predefined purposes by the interrogatives patterns.

Secondly, for each purpose, the topics can be discovered by clustering analysis using the MLDD distance measurement for the issue vectors. Since the number of topics is unknown so far, only the criterion of the required average documents‟ distances of each topic can be set, the ISODATA clustering algorithm [6] which can adaptively divide and merge the clusters to find the most suitable cluster number for the data distribution is applied. The cascading topic clustering algorithm is proposed as follows.

Algorithm: The cascading topic clustering

Input: Keyword vectors of forum documents, Purpose-based ontology Output: Clustering results

Step 1. Predict the purposes of forum documents as what, how, why, others, etc,.

Step 2. For each purpose, retrieve the concerned concepts set of this purpose from the Purpose-based Ontology.

Step 3. For all documents in this purpose, apply the ISODATA clustering algorithm with the weighted sum of multi-level document similarity.

Step 5. Store the clustering results into the associated purpose subclass of Purpose-based Ontology.

Step 6. If there still exists an un-clustered purpose, then go to Step 2 for next purpose.

Step 7. Output and save the clustering result as topics.

Example 5.3 Example of the cascading data mining for topic clustering

Assume that there are 10 Docs with keywords ∑={k1, k2, k3, k4, k5, k6, k7, k8, k9, k10,

k11, k12, k13}.

Table 5.3 The keyword vector of documents

k1 k2 k3 k4 k5 k6 k7 k8 k9 k10 k11 k12 k13

Doc1 1 1 1 1 1 1 1 1 1 0 0 0 1

Doc2 1 1 0 1 1 1 1 1 1 1 0 1 0

Doc3 1 1 0 1 1 1 0 0 0 0 0 1 0

Doc4 1 1 1 1 1 1 1 1 1 1 1 0 1

Doc5 1 1 1 1 1 1 1 0 1 1 1 0 1

Doc6 1 0 0 1 1 1 1 1 1 1 0 1 0

Doc7 0 1 1 0 1 1 0 1 0 0 0 0 0

Doc8 0 1 1 0 1 1 0 1 0 0 0 0 0

Doc9 1 1 0 1 1 1 0 0 0 0 0 1 0

Doc10 1 1 1 1 1 1 0 1 0 1 1 0 1

These documents are classified first, and the classification results can be stored in Table 5.4.

Table 5.4 The result of applying structure-based classification

Purpose Label Doc

What‟s the meaning {Doc₁, Doc₄, Doc₅, Doc₇, Doc₈, Doc₁₀} What‟s wrong {Doc2, Doc3, Doc6, Doc9}

Next, for each purpose, the clustering analysis is applied and the clustering results can be stored as the data fields as Table 5.5.

Table 5.5 The Result of Applying ISODATA Clustering Algorithm

Purpose Label Cluster DOC Cluster Centers What‟s the meaning C_1-1 {Doc₁, Doc₄, Doc₅,

Doc10}

<1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1>

C1-2 {Doc7, Doc8} <0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0>

What‟s wrong C_2-1 { Doc₃, Doc₉} <1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0>

C2-2 {Doc2, Doc6} <1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0>

Here we introduce the hot topics about “Object-Oriented Programming (OOP)”.

As shown in Figure 5.7, the number of documents of different purposes in topic

“OOP” is presented. As we can see, the purpose of “How to do” is the frequently discussed purpose in forum documents. The “What‟s the difference” is the second one.

Figure 5.7 The purposes of the hot topic about “OOP”

With the further analysis of the purpose “How to do”, the issues about

“constructer”, “the initiate and release of the object”, etc. are discussed frequently. In the purpose “What‟s Difference”, the issues about “C and C++”, “structure and class”, etc. are discussed frequently. The rest of main issues discussed in each purpose are shown in Table 5.6. With the wide purpose hot topic analysis, the inquiry topics of

learners can be shown.

Table 5.6 The issues of different purposes discussed in the hot topic “OOP”

Hot Topic: Object-Oriented Programming (OOP)

Purpose Issues

What‟s Mean Concept about template and data member revision;

Concept about static object

What‟s Wrong Problem about free and delete from memory;

Why can‟t it pass-by-reference

What‟s Difference Conflict about dynamic class creation and overloading;

Differences between structure in C and Class in C++;

Difference between define and typedef;

Difference between WaitEvent and SignalEvent;

Difference between iterator and [] of STL How to Do How to use constructer in Class;

How to initiate the array in construct;

How to delete the object created from overloading How to Use How to connect mysql DB with C++;

How to compile the class in another directory;

How to use winsock.h in dev c++;

Can I use API in C

5.4 Evaluation

To stimulate the problem solving activities in the community, the social network service of Web 2.0 with trustworthy experts finding is proposed. As shown in Figure 5.8, while questioner posts a question, the main keywords of the question is firstly identified with the interaction to the questioner. The expert finding service will find trustworthy experts based on their topic interest with respect to the posted question.

Next, the questioner can configure the parameters to change the priority of the recommendation to fit their required trustworthiness and availability. The trustworthiness means that the experts may have topic interests to the posted question and have good reputation based on their portfolio on the forum. The availability means that the experts are still present and keep visiting the forum in recent months.

Thus, with the recommended experts list, the system can actively organize the social network from questioner to these experts by inviting them to help solving the posted question on the forum.

Figure 5.8 The trustworthy expert finding service to bridge the social network for problem solving

Since the trustworthiness of the service is based on the posted forum documents, it

may result in the phenomenon of “the more discussions you post, the more your social network can be explored”. Thus, the service on the forum can facilitate the collective intelligence and the social network of Web 2.0 to enrich the programming problem solving in the learning community.

The expert‟s profile including topic interest, trustworthy and presence is defined.

An example is shown in Figure 5.9.

 Topic interest: referred to the number of concepts in the issue layer of PCO, the topic interest is a vector of Boolean values where k-th element is assigned to 1 if the expert has posted the documents related to the k-th issue before.

 Trustworthy value: it is also a vector with the same length as that of topic interest to represent the reputation of the expert in the specific topic. The k-th element of trustworthy value is represented by the ratio of the number of satisfied questioners to the number of all questioners with respect to the expert‟s historical replies. The larger the value, the more trustworthy the expert is.

 Presence value: it is a list of array which records the ratio of the number of online days to the number of all days in each month. The M1 is the ratio of that in the last month; M2 is the ratio of that in two months ago, etc.

Figure 5.9 An example of expert’s profile representation

With the defined expert profile, the aim of the expert finding service is to retrieve the relevant experts whose profiles are related to the posted question. It can be formulated as the objective indicators as follows.

A Question Q is inputted by a questioner to express his/her programming problem

with the concept weight vector. When a learner inputs a sentence of question description, the predefined thesaurus is applied to extract the frequently used keywords. Thus, the question is transformed into the keyword vector where the length of the Q is limited to the number of issues in PCO. Next, the weight values, from 0 (not related), 0.5 (partially related), to 1 (highly related), can be adjusted by questioner to represent the relation degree of his/her question to the issues. In general, the keywords of similar meaning are recognized as the same concept. Since the documents in the forum are short sentences, the length of concept weight according to our experiment can be limited to the vector with less than 50 keywords.

Issue V₁ V₂ V₃ … V_n

weight 0.5 1 0.5 … 0

Question

Figure 5.10 Keyword vector of posted question

1) The trustworthy expert finding

In order to determine the degree of relevance of a query and experts, the indicators of objective function are defined. Assume that we are given a query Q and an expert E. Let E.Interest represent the interest vector and let E.Trust represent the trustworthy vector of expert‟s profile. Here, an objective function Obj for measuring the correlation between query and expert is proposed by combining the objective functions of ObjTrust and ObjAvailable.

 Trustworthiness: The correlations of query vector Q with vectors E.Interest and E.Trust respectively are firstly calculated by the inner product represented as

Interest E

Q . and QE.Trust each of which represents the similarity of two vectors. Thus, the trustworthiness value is measured by the weighted sum of two

inner products with theα factor to control the importance weighting between trustworthy or topic interest. The objective function of trustworthiness is defined in Equation 1.

Obj_Trust(Q, E) =(QE.Interest)(1-)(QE.Trust) (1) where the factor α, 0 < α < 1, is used to control the importance weighting between trustworthy or topic interest.

 Availability: To reduce the problem of asynchronous, the existing experts can be invited to join the problem solving discussion with higher priority. Thus, the availability parameter is included in the objective function. The objective function of availability is measured by the weighted average of presence records in expert‟s profile. Assume there are N records in the presence array and the E.M_i represents the i-th element in the array, the objective function is defined in Equation 2. The availability is judged by the number of login records within a period of time.

Accordingly, the factor τ is proposed to annotate the fading of the behavior influence based on the probability pheromone update of Ant algorithm [28].

ObjAvailable(E) =

 

Therefore, the range of these two objective terms, Obj_Trust and Obj_Available, are both in [0, 1]. The objective measurement Obj for question Q and Expert E which is a linear combination of Obj_Trust and Obj_Available is defined in Equation 3.

Obj(Q, E) = β× ObjTrust(Q, E) + (1-β) × ObjAvailable(E) (3) where the factor β, 0 < β < 1, is used to control the weight between trustworthiness and presence.

Based on the definition of objective measurement function, there are several heuristic strategies for the questioner to choose.

 Trustworthy experts first (e.g. set α=0.2, β=1): Recommend the experts who are highly related to the question and have high reputation to help solving the posted question. It can be used for the difficult problem solving topics, such as the program debugging, how to implement new application, etc.

 Similar topic interest experts first (e.g. set α=1, β=0.5): Recommend the experts who actively reply the related questions to help solving the posted question. It can be used for finding learning partners to discuss the topic, such as how to configure the developing platform, how to use the specific function or modules, etc.

 Expert’s availability first (e.g. set α=0.8, β=0.2): Recommend the active users to reply their opinions. It can be used for the need of quick feedback, such as the comparison of different SDK, opinion sharing for new technology, etc.

2) The feasibility evaluation

 Training set for ontology construction

The data of programming learning forum “Programmer-Club” consisting of 14,000 forum documents and 1734 user accounts are collected from year 2001 to 2007 as the test data. The characteristics of the forum test collection are listed in Table 5.7.

Table 5.7 Characteristics of the test forum documents database

Forum Name No. of postings

No. of community members

Subject

Programmer-Club 14,183 1734 C/C++

programming

 Sample questions

To compute the precision of the proposed approach in different questions, four frequently asked hot topics which are issues of “Q1: the object-oriented programming”, “Q2: the string processing”, “Q3: the array processing”, and “Q4: the loop statements” are collected as sample questions.

 Expert finding service configurations

Three expert finding strategies with different configurations of parameter values are

在文檔中自我組織式行為塑模應用於程式學習之研究 (頁 52-0)