Content Diversity - Experimental Setup - 基於辯論歷程之反論點生成

6.2 Experimental Setup

7.1.1 Content Diversity

To further understand the quality of generated content, we investigated the lexical diversity of the generated responses. We can infer that a response has more distinct n-grams would also have higher content diversity [9]. We illustrate the numbers of distinct unigrams, bigrams, and trigrams for different models in Figure 7.1. As the figure shows, model-generated arguments have lower unigram diversity, but achieve higher on both bigram and trigram compared to the human arguments. On the other hand, the retrieved passages have the highest content diversity over all the other competitors. It conforms with the fact that the news written by trained journalists tend to have higher quality (e.g. lexical diversity).

In terms of the effect of incorporating discussion history on content diversity, we found that the models leveraging discussion history information (i.e. Multi. and Multi.+Spk.) tend to have higher diversity than the single-turn model. The speaker embedding also increases the diversity of the generated counter-arguments.

Next, we illustrate the average type-token ratio (TTR) of the counter-arguments in

35.9

Figure 7.1: Average number of distinct n-gram per argument.

Figure 7.2. As the figure shows, the models that generate longer counter-arguments (i.e.

Multi. and Multi.+Spk.) can still maintain comparable TTRs.

7.2 Human Evaluation

To understand humans’ subjective view on human/model written counter-arguments, we used Amazon Mechanical Turk (M Turk) to conduct human evaluation. In this section, we first talk about the annotation setup details for our human evaluation, including the guidance and the annotation interface we present to the annotators. Then we discuss our findings according to the result of the human evaluation.

0.81

Figure 7.2: Type-token ratio of different models.

7.2.1 Annotation Setup

We randomly picked 43 threads in the test set for the human annotation. Given a thread, the statement of the original poster and the corresponding comments in the discussion history are shown and there are 15 Likert scales to be rated (3 aspects per argument). Also, the order of the candidate responses to be annotated in each thread are shuffled to avoid the annotators’ bias. We hired three English native speakers to do the annotation job. Each annotator was asked to read the annotation guidance as shown in Figure 7.3, and then do the following annotations for all the threads. There are three aspects to be rated for each candidate:

• Appropriateness: Whether the response has the opposite stance as the original poster and has relevant content.

• Informativeness: Whether the response has many distinct talking points.

• Coherence: Whether the response is coherent with the discussion history (along with the responses).

An example of annotation interface for a single thread is shown in Figure 7.4.

Read the discussion thread below and use the sliders to indicate how much you agree with the statements (1 = Strongly disagree, 5 = Strongly agree)

All threads in this task are from a subreddit named Change My View. And each thread has following contents:

1. Original post: A statement that expresses the posters viewpoint (thoughts, feelings, attitude or opinion) on a certain topic.

2. Responses: An ordered list of responses to the original post. The ﬁrst item in this list is a comment that attempts to change the viewpoint of the original poster. Comments that follow after the ﬁrst response are either more attempts to convince the original poster to change their viewpoint, or comments by the original poster that attempt to defend their viewpoint.

NOTE:

For each thread in CMV (Change My View), the original poster wants the community to change his/her opinion on a given topic. Thus, all the responses written by (Others) should take an opposite stance on the topic than the original poster. Responses denoted by (Original Poster) are written by the original poster themselves.

The candidates below should be seen as the responses written by Others, and thus should have the opposite stance as the original poster. Each candidate should be replying to the last response in the list (or directly to the Original Post if no other response is provided).

Each of the following candidates has 3 aspects to be rated.

Appropriateness: Whether the response has the opposite stance as the original poster and has relevant content.

Informativeness: Whether the response has many distinct talking points.

Coherence: Whether the response is coherent with the discussion history (along with the responses)

Figure 7.3: Annotation guidance for human evaluation.

7.2.2 Result

After collecting the annotation results from the annotators, we found that some of threads are relatively hard for annotator to rate, resulting low agreement score. We thus filtered out the threads having overall agreement scores of Krippendorff’s alpha lower than 0.1. The

resultant evaluation results are listed in Table 7.2. The annotators achieve 0.32, 0.37, and 0.35 on Krippendorff’s alpha for Appropriateness, Informativeness, and Coherence, respectively, implying a moderate agreement among the annotators.

Appro. Info. Coher.

Human 3.278 2.736 2.944 Retrieval 2.361 2.292 2.444 Single 1.444 1.208 1.583 Multi. 1.611 1.361 1.361 Multi.+Spk. 1.361 1.152 1.361 Table 7.2: Human evaluation result.

As the ground-truth counter-arguments, Human outperforms all the other results in-cluding the retrieved passages. The result also shows that by incorporating the information of discussion history, the model can generate more appropriate and more informative con-tent, while relatively low coherence in comparison to the single-turn model. Interestingly, although speaker embedding makes the multi-turn model perform well in most of the au-tomatic evaluation, it does not achieve better rates for human evaluation, even lower than the single-turn model.

Thread 5

Original Post:

i ’ m a woman , a feminist and a huge political theory buff . i ’ ve struggled with gender all my life and i ’ m ﬁnally in a place where i can be the kind of woman i ’ d like to be . i don ’ t feel guilty about doing the things i enjoy , regardless of how they ’ re gendered , and i thought that this was a great victory ... until i literally got banned from r/feminism for saying this . apparently “ the banishment of gender is a core goal of the feminist movement

” now ? excuse me ... what the fuck ? am i a crazy person for telling these mods that their goals have become oppressive ? that it ’ s ﬁne if gendered behavior isn ’ t mandatory , but it also can ’ t ethically be banned ? people enjoy most of the trappings of gender . obviously we have to eliminate , reform or reassess the ones that subjugate people ... but most of these behavioral patterns are harmless . is that really such a wild hot take that it deserves a ban ? i wasn ’ t even being angry ( fyi women are allowed to feel anger , but that ’ s a conversation for another time ) . what do you folks think ? can you help ? do you disagree ? do you have anything to add ?

Response 1:

(Others) this & gt ; “ the banishment of gender is a core goal of the feminist movement ” &

amp ; this & gt ; am i a crazy person for telling these mods that their goals have become oppressive ? that it ’ s ﬁne if gendered behavior isn ’ t mandatory , but it also can ’ t ethically be banned ? do n't really match . the idea of banishing gender is generally not the banning of cohering to current gender roles but removing the societal compulsion to follow these pressures and roles . can you link to the thread you got banned for so people can understand the context of the conversation ? it might also help clear up the difference between these ideas but if not could you comment on why you think these are not meaningfully different ideas ?

Response 2:

(Original poster) well , if anyone had had the foresight to say that , i very much doubt that this would have been a problem ! that would be the lucid and discerning way to phrase what they were saying . unfortunately , i ’ m banned from the thread , so i ’ m not sure how to link back to it anymore .

Candidate Responses:

1. even if banned you should be able to copy the permalink from your proﬁle and pasting it here . it would give everyone valuable context i feel .

Appropriateness Informativeness Coherence

Figure 7.4: Example annotation interface of a single thread. The rest 4 candidate re-sponses are omitted for simplicity.

Chapter 8 Discussion

8.1 Effect of Speaker Embedding

We add speaker embedding layer into the proposed model to attend the speakers along the debating history. To further investigate the effect of adding the speaker embedding layer, we conduct a experiment to our proposed model. We fix the speaker label to be 0 (i.e.

neither original poster or others) for each token. The model is then used to go through the same generation process with our testing data. Table 8.1 shows the automatic evaluation of the model fixing the speaker in comparison to other models. As the table shows, if the speaker labels are fixed, the model cannot correctly identify the speakers in context, and consequently has a drop in performance compared to the model having correct speaker labels. However, thank to the help of incorporating the debating history information, the model can still have better performance in comparison to the single-turn model which only has information of retrieved passages.

BLEU-2 BLEU-2 (Multi.) ROUGE-L Length

Single 10.12 7.22 25.72 58.70

Multi. 10.73 7.71 26.91 65.60

Multi.+Spk. 10.62 7.75 27.10 64.28

Fixed Spk. 10.52 7.56 26.80 61.68

Table 8.1: Evaluation of model with fixed speaker embeddings.

8.2 Sample Generated Arguments

In this section, we show the sample counter-arguments generated by different models alongside the human-written counter-arguments in Figure 8.1. As the sample argument shows, multi-turn model generates an argument starting with a few filler sentences, ”I’m not sure what you’re trying to say ...”, then a claim ”It is not a despite regulation ...”

followed by its premise ”It has no bearing on ...”. This argument does have a opposite stance to the original poster who think that net neutrality is kind of a coercion, and it also correctly recognize that thread is talking about a regulation issue.

Meanwhile, the single-turn model states the opposite stance but only contain some off-topic sentences. It can also be seen that the model just keeps repeating ”It’s not a perfect ...”. The model with speaker embedding, on the other hand, it does not generate off-topic sentences, but there are only some general statement to oppose the original poster.

As a result, the multi-turn model which incorporates the information of debating his-tory generate more appropriate and longer responses. Although the speaker embedding helps the model to get higher scores on automatic evaluation, it might just a result of mimicking the frequent used sentences or terms in human counter-arguments.

Original Post:

The expressed goal of net neutrality advocates is desirable. I am more in favor of net neutrality than opposed to it … However, my view is that net neutrality, like other regulation is unacceptable coercion, and is only beneficial as a result of other government coercion (like giving billions of tax money to these companies for fiber that amounts to nothing ) …

Comment 1 (User in CMV):

Are you suggesting we force people to use onion-type routing even if they want a faster more direct connection? …

Comment 2 (Original Poster):

I am not suggesting we force people to use onion type routing . I'm suggesting that it could be a technological solution to the problem of neutrality.

Human:

Most people will want to go onion-free to take advantage of the fastest hulu/netflix/youtube connections they can get. Increased or widespread usage will not fix those performance problems because the dedicated bandwidth and widespread caching are key to fast video delivery and can't work with onion routing.

Single-turn:

I don't think this is true. It's not a perfect system. But it's a net positive. It isn't a perfect solution.

Multi-turn:

I'm not sure what you're trying to say here. Are you arguing that it isn't a good idea? I don't think so.

It's not the same thing. It is not a despite regulation. It has no bearing on whether or not it should be regulated. I do think it's a matter of opinion. But that's not how it works in the first place.

Multi-turn + Spk.:

I don't think that's a good argument. I'm not sure what you're trying to say here. You're just trying to make a point. If you don't like it, then you're not going to be able to change your view.

Figure 8.1: An example of generated counter-arguments.

Chapter 9 Conclusion

This thesis is the first research exploring the task of argument generation with multi-turn debating history context. We address a new problem of generating counter-arguments with a statement of viewpoints and its debating context. To train a generator for this task, we construct a large-scale dataset which contains around 800k counter-arguments. Our experiment examine the performances of three different models. The baseline model is the single-turn model proposed by Hua et al. [5] which only contains the information of retrieved passages. Another two are the models that incorporate the debating history context, and one of them has an additional speaker embedding for the model to capture the speaker of each utterance. As shown in the experimental results, incorporating the debating history do help the models generate more appropriate arguments in terms of both automatic evaluation and human evaluation. We also notice that even though speaker embedding help the multi-turn model get higher scores in automatic evaluation, it might hurt the coherence of the counter-arguments.

During this research, we also have some inspiration for improving the task, and we list them as the future works after this thesis. First, the keyphrases in the keyphrase bank

are extracted from the retrieved passage as the prior research did. However, although they provide the high-quality content, they are not directly related to the debating history.

Also, due to they are extracted from a fixed IR database, the diversity of the keyphrases are bound by the coverage of the database. On the other hand, if we choose not to extract the keyphrases from the debating history, we could add a training loss of passages into our model to learn how this phrases are used in the passages. Due to the particularity of argument generation, we also think that an tailor-made evaluation for counter-argument is needed. For example, we can first identify the counter-argumentative components of the target counter-argument, than calculate the coverage of the identified components to imply the quality of a given generated counter-argument. The last direction is that we think despite counter-argument generation is a field close-related to argument mining, the research to date does not fully leverage the AM techniques. For instance, before doing sentence planning and content realization, we can follow the stages of AM (e.g. identifying argumentative component, find relations among components) first.

Bibliography

[1] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated re-current neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning, December 2014, 2014.

[2] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, Nov. 1997.

[3] X. Hua, Z. Hu, and L. Wang. Argument generation with retrieval, planning, and realization. In Proceedings of the 57th Annual Meeting of the Association for Com-putational Linguistics, pages 2661–2672, 2019.

[4] X. Hua and L. Wang. Neural argument generation augmented with externally re-trieved evidence. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 219–230, Melbourne, Australia, July 2018. Association for Computational Linguistics.

[5] X. Hua and L. Wang. Sentence-level content planning and style specification for neu-ral text generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Hong Kong, China, 2019. Association for Computa-tional Linguistics.

[6] B. Lavoie and O. Rainbow. A fast and portable realizer for text generation sys-tems. In Fifth Conference on Applied Natural Language Processing, pages 265–268, Washington, DC, USA, Mar. 1997. Association for Computational Linguistics.

[7] D. T. Le, C.-T. Nguyen, and K. A. Nguyen. Dave the debater: a retrieval-based and generative argumentative dialogue agent. In Proceedings of the 5th Workshop on Argument Mining, pages 121–130, Brussels, Belgium, Nov. 2018. Association for Computational Linguistics.

[8] R. Levy, B. Bogin, S. Gretz, R. Aharonov, and N. Slonim. Towards an argumentative content search engine using weak supervision. In Proceedings of the 27th Interna-tional Conference on ComputaInterna-tional Linguistics, pages 2066–2081, Santa Fe, New Mexico, USA, Aug. 2018. Association for Computational Linguistics.

[9] J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:

Human Language Technologies, pages 110–119, San Diego, California, June 2016.

Association for Computational Linguistics.

[10] C.-Y. Lin. ROUGE: A package for automatic evaluation of summaries. In Text Sum-marization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics.

[11] J.-F. Lin, K. Y. Huang, H.-H. Huang, and H.-H. Chen. Lexicon guided attentive neural network model for argument mining. In Proceedings of the 6th Workshop on Argument Mining, pages 67–73, Florence, Italy, Aug. 2019. Association for Com-putational Linguistics.

[12] J. Lu, C. Zhang, Z. Xie, G. Ling, T. C. Zhou, and Z. Xu. Constructing interpretive spatio-temporal features for multi-turn responses selection. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 44–

50, Florence, Italy, July 2019. Association for Computational Linguistics.

[13] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA, 2002.

Association for Computational Linguistics.

[14] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word repre-sentation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, Oct. 2014. Asso-ciation for Computational Linguistics.

[15] G. Rakshit, K. K. Bowden, L. Reed, A. Misra, and M. Walker. Debbie, the debate bot of the future. arXiv preprint arXiv:1709.03167, 2017.

[16] C. Reed, D. Long, and M. Fox. An architecture for argumentative dialogue plan-ning. In International Conference on Formal and Applied Practical Reasoning, pages 555–566. Springer, 1996.

[17] N. Reimers, B. Schiller, T. Beck, J. Daxenberger, C. Stab, and I. Gurevych. Clas-sification and clustering of arguments with contextualized word embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Lin-guistics, pages 567–578, Florence, Italy, July 2019. Association for Computational Linguistics.

[18] A. See, P. J. Liu, and C. D. Manning. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, 2017.

[19] I. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau. Building end-to-end dialogue systems using generative hierarchical neural network models, 2016.

[20] X. Shen, H. Su, W. Li, and D. Klakow. NEXUS network: Connecting the preceding and the following in dialogue generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4316–4327, Brussels, Belgium, Oct.-Nov. 2018. Association for Computational Linguistics.

[21] H. Su, X. Shen, R. Zhang, F. Sun, P. Hu, C. Niu, and J. Zhou. Improving multi-turn dialogue modelling with utterance ReWriter. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 22–31, Florence, Italy, July 2019. Association for Computational Linguistics.

[22] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q.

Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3104–3112. Curran Associates, Inc., 2014.

[23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neu-ral Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017.

在文檔中基於辯論歷程之反論點生成 (頁 46-0)