DISCUSSION

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Chapter 5 DISCUSSION

Referring to Chapter 4, it is easy to find analyzation of our study from user behavior statistical data, experts rating posts, to questionnaires. We decide to discuss results following to this order.

Section 5.1 presents the discussion of effectiveness analysis. Section 5.2 discusses the analysis from an efficiency perspective. Section 5.3 is the discussion of the circumstance situation. After that, we talk about posts rated by experts in section 5.4 and satisfaction questionnaires in 5.5.

5.1 Effectiveness perspective

There are three hypotheses and three measurements in the effectiveness method. In this section, we first talk about the length of a post, medical-related features, and the existence of description.

From our analysis result, the length of a post has little effects on model which indicates length may usually be affected by an asker’s background and situation (e.g. the severe level of illness in real-world or the information of supportive documents have provided in our user study), not using a RS or not or having different types of RSs. Though posts on professional Q&A forums usually have a long length in the real world, by direct observing and interviewing to participants, their behaviors show that they do not concern about the length of a post much because the whole experiment is a simulation. Not all participants can imagine the situation successfully.

They sometimes have no idea what to write after reading supportive introduction and documents. Several collected posts only have one or two sentences proves that. Of course, we still have participants who are familiar with medical topics or at least are good at proposing questions and discussions that makes several collected posts worth investigating. In addition, different illness situation we tested additionally affects an asker’s length of a post. The results present foodborne posts have shorter length than allergy and flu. We think that is because most participants are not familiar with foodborne, so if they are out of ideas when formulating posts, the length of posts become short.

Next, the number of medical-related features analysis shows there is a main effect on models which means using a RS can produce more medical-related features than not using a RS.

However, different from what we expect, semantic has better result than word embedding. We thought word embedding can provide more relative topics based on common wordings than

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

the corpus of semantic coming from a dictionary, especially, recommendations sent back to askers are the synonym of features. But we figured out that if word embedding is not a robust system, it is possible to have greater performance in semantic situation. Particular askers knowing a lot of medical-related words or someone using identical medical-related words from supportive documents can trigger semantic outputs more medical features’ synonym to choose.

Having the description in a post is an important element for experts online to evaluate askers’

situation because they cannot diagnose a person like clinics’ doctors. Only one question with a question mark is usually not enough for experts to solve problems. Our data gives out main effects between having the description on model and on illness. A deeper investigation presents word embedding has lower possibility to add descriptions in a post contrast to baseline. We think this outcome indicates that people are still familiar with the posting procedure without any interferences, but the truth is our system needs much effort to deal with. In addition, because it is just a simulation scenario, most participants lack strong intentions to find solutions.

They may feel comfortable to write intact posts in a stressless situation (e.g. baseline model).

From the viewpoint of illness, foodborne has higher possibility to have more information than allergy and flu. It is reasonable that allergy and flu are common to most participants. They can

summarize these two illnesses efficiently while the unfamiliar one will be written with details.

This section is to answer whether the topics (features) recommendation is useful when askers are formulating posts on Q&A online forums. It is common to find Q&A online forums have mechanisms to query exist questions but never find a supportive system focusing on a post action during the asking process. Although the result of word embedding model doesn’t fit what we thought, two of third significant differences on models found between having a RS and not having a RS reflect concepts of participating more in the asking process and features recommendation worth to be developed.

5.2 Efficiency perspective

From an efficiency prospect, we want to know which models, word embedding and semantic, has better performance when formulating posts. Therefore, adoption of recommended features and the number of medical-related features found are two measurements we set.

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

First, we briefly count total number of adoptions in word embedding and semantic respectively, 43 and 31 times. However, each model has 54 posts and the adoption is not equally distributed in each post. There are still a lot of posts without selecting any recommended features. Not surprisingly, the calculation failed to find an optimized result that can converge successfully.

The main issue is our system is still a prototype. We still need more resources to build a complete word embedding models. If the recommendation looks unwell, even the average of

“want to use this kind of topic RS someday” is 3.81/5.00 in our post-questionnaire, it is hard to attract attention because of worsening user experiences.

The number of medical-related features is another measure we analyzed. The result shows main effects on model. Word embedding has significant effects in a negative way while allergy gave significant effects in a positive way. The reason that the performance of semantic being greater than word embedding has discussed mostly in 5.1. The corpus resource of sematic, getting from an artificial manipulation of WordNet, performs good when askers know how to query more professionally. However, our word embedding model, trained by a machine, still cannot avoid ambiguous recommended features sometimes. The only way to improve the usefulness of our word embedding model is to collect more suitable and larger data and train it again.

This section wants to know if the word embedding model is more useful than the semantic model. It is obvious to see that the current version of word embedding is not enough for processing. There is less evidence to prove semantic or word embedding is good to deal with posting situation. We can only confirm it is better to adopt a RS than not to adopt a RS.

5.3 Circumstance perspective

In this part, we are curious about whether a general task and a particular task takes different amount of used time when adopting a RS or not adopting a RS. The result shows word embedding with daily scenario task and baseline with daily scenario task have significant difference in a positive way. Word embedding with class discussion task and baseline with class discussion task have significant difference positively too. Besides, semantic with daily scenario task and baseline with daily scenario task are tested to have a significant effect on the amount of used time. The reason of above statements is probably that we encouraged every participant to try our RSs during the experiment, so most of them at least executed the RS one

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

time. The amount of used time became longer. Unlike formulating posts with a RS, baseline model is permitted to be accomplished as fast as a participant wants, so the process time can turn to be significant shorter. The only exception in our user study is semantic RS when facing a particular and unfamiliar situation. From our observations, semantic RS can be useful if an asker is familiar with medical conditions while semantic RS can be useless if an asker has no idea what to ask and decide to finish the post slowly or quickly. The amount of used time is difficult to assert at this moment because we can interpret that having a RS, an asker needs more time to absorb recommendations or an asker formulates posts faster by referring to recommendations, completely opposite. But in our case, we only want to figure out if there are significant differences between used time and different tasks under a RS or using baseline. It is obvious that most measurements in 5.3 have proved the third hypothesis.

Although our significant effects all spend more time, we clarified spending longer time to formulate a post is not a bad thing. If askers create posts completely and formulate them clearly by recommended features from a RS to enrich details of problem, experts and answerers on the Q&A online forums may know how to support askers faster.

5.4 Subjective adjudgment

The Section 5.4 is meant to answer the second research question, “Do questions supported by the posting recommender attract experts to answer”. Because the Kappa value agreement is low between two experts, we cannot average two results to analyze. So, we will discuss them separately. To compare the data collected from participants to the experts’ rating, we controlled model, illness, and a measure of 4.2, the existence of description, to execute the GEE. If there is a significant difference between having the description and standards of rating lists in a positive way, we can conclude add the description into a post may have higher possibility to get good points from experts.

By observing the answer from experts having a pharmacist certificate, the willingness and complete standard have a main effect on the existence of description. Labeled “False” posts shows less possibility to earn points from the expert. Complete standard’s pairwise comparison of labeled “False” and “True” description presents “False” may not be easy to get better points from the expert. Another answer list from experts achieving professional medical qualification

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

gives out that three standards have a main effect on the existence of description. Labeled “False”

posts has lower possibility to get higher clear points from the expert. The pairwise comparison shows the lower points on willingness and complete if the post lacks of description. Getting good points means the post can be understood by experts quickly (do not have to ask for details several times). Posts comprehended well by experts will then have a possibility to be solved.

However, the current result is not enough to explain the connection with a post recommender and higher points from experts. Owing to better points result with a description, we further run a pairwise evaluation between the interaction of models with labeled “True” description and three rating standards. According to the first expert result, complete points can be higher when using a posting recommender with description. Complete and clear points of second experts result are able to increase if an asker adopts the posting recommender and adds more details into a post. Although the situation may sometimes depend on different experts, we conclude it is possible to attract experts to answer a post after processing a RS to gain information. In addition, we found willingness standard is not significantly affected by a posting recommender with description. It is because experts believe they have somehow responsibilities to answer what patients suffer, they seldom refuse to requests. So, willingness may not be a good standard.

5.5 Discussion of questionnaire

We have organized the post-questionnaire result in 4.4. Participants gave all questions above half of five points. The useful points of word embedding method is higher than semantic method. But, the performance of semantic method is better than word embedding method in our efficiency analysis. Besides the quality of posting recommenders, an incomplete way to record the user behavior is another reason. Most participants aren’t used to clicking the copy sign button to show they accept new ideas. So, an eye-tracking tool may be useful to get implicit data in a future study. Further, the system process is understandable and the willingness of using feature recommendation in real world is not low (3.81, the second highest in Table 18).

However, we still have to improve the design of presenting recommendation (3.33, the second lowest in Table 18). Currently, each recommended feature shows at most 10 candidates on the showing table. If the feature extractor sends back 10 reasonable topics, it has total 100 words on the webpage. Though there are useful features and ideas, the complex presentation has possibility to interfere users’ judgements like losing patient to browse all recommended words.

‧ 國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

在文檔中線上論壇提問推薦機制：以醫療照護問答網站為例 - 政大學術集成 (頁 52-57)

國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

Chapter 5 DISCUSSION

5.1 Effectiveness perspective

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

5.2 Efficiency perspective

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

5.3 Circumstance perspective

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

5.4 Subjective adjudgment

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

5.5 Discussion of questionnaire

‧ 國

立 政 治 大 學

‧

N a tio na

l C h engchi U ni ve rs it y

立政治大學

立政治大學

立政治大學

立政治大學

立政治大學

立政治大學