o Generate emails that reflect sender style and intent of communication o Provide emails as part of synthetic
evidence of insider threats for
purposes of training, prototyping, and evaluating anomaly detectors.
o Senders’ characteristics are modeled based on their writing patterns
(structure, politeness, etc.) instead of their attitudes
o 1st Stage: modeling sender style and topic structure for email organization o 2nd Stage: stochastic generation of
language for surface realization
Motivation
Approach
Two-Stage Stochastic Natural Language Generation for Email Synthesis by Modeling Sender Style and Topic Structure
Yun-Nung (Vivian) Chen and Alexander I. Rudnicky
2. The Framework
• We propose a two-stage stochastic NLG process for email synthesis that models sender style and topic structure.
• Subjects can detect sender style and can differentiate template-based (sentence-level) and stochastically-generated sentences (word-level).
• This technique can be used to create realistic emails and that email generation could be carried out using mixtures containing additional models based on other characteristics.
• The current study shows that email can be synthesized using a small corpus of labeled data; however these models could be used to bootstrap the labeling of a larger corpus which in turn could be used to create more robust models.
3. Training Data Preprocessing
The ratio of subjects’ preference according to different criteria
• Each email can be treated as a structural label sequence
1st stage structures the emails according to sender style and topic structure (high-level generation)
2nd stage synthesizes text content based on the particulars of an email element and the goals of a given communication (surface-level realization).
• Evaluation of Sender Style Modeling
o Rate synthesized emails for each sender on a scale of 1 (highly confident that email is not from the sender) to 5 (highly confident that email is from the sender)
o Average normalized scores the corresponding senders receive: 45% > 33% [for 3 senders]
• Evaluation of Surface Realization
o Compare template-based generation (sentence-level NLG) and stochastic generation (word- level NLG) on the same email structures.
Different senders tend to structure emails in different ways.
The word-based stochastic generation outperforms the template-based algorithm and requires less effort in terms of knowledge engineering.
Predicting Mixture Models
Email Document
Archive
Building Structure Structural Label LM
Annotation Structural Label Sequences
Generating Email Structures
Generated Structural Label
Sequences
<greeting>
<inform>
…
Slot-Value Pairs
Slot Annotation
Emails w/
Slots
Building Content
LM
Generating Text Content
Scoring Email Candidates Email Candidates
Filling Slots
Synthesized Emails
Hi Peter Today’s ...
1st Stage: Modeling Sender Style and Topic Structure for Email Organization
2nd Stage: Surface Realization
Hi [Person]
Today’s ...
Sender-Specific Model
Topic-Specific Model
Sender Topic
Input to NLG
Training Data Preprocessing
1. The Task
4. Modeling Sender Style and Topic Structure for Email Organization
From: Kitchen, Louise
Sent: Thursday, April 05, 2001 11:15 AM To: Beck, Sally
Subject: Re: Costs
Shukaly resigned and left.
But I assume the invitation will be extended to all of their groups so that whoever they want can attend.
I would actually prefer that the presentation is actually circulated to the groups on Friday rather than presented as we will wait forever on getting an offsite together.
How about circulating the presentation and then letting them refer all questions to Rahil - see how much interest you get.
One on ones are much better and I think this is how Rahil should proceed.
We need to get in front of customers in the next couple of weeks.
Let's aim to get a least three customers this quarter.
Louise suggestion
inform
request signature
header
content
• Structural Label Annotation
o 10 email structure elements (greeting, inform, request, suggestion, question, answer, regard, ack., sorry, sign)
• Slot Annotation
o General class: 7-class extracted by Named Entity Recognition (location, person, org., time, money, percent, date)
o Topic class: 3-class extracted by keywords (meeting, issue, discussion)
For each structural label:
1) Building Structure Language Models
o Sender-specific structure LM (trigram w/ smoothing) o Topic-specific structure LM (trigram w/ smoothing)
2) Predicting Mixture Models
3) Stochastically Generating Email Structures
o Generate structural label sequences randomly according to dist. of mixture models
sender-specific model
topic-specific
model mixture
model
5. Surface Realization
For each structural label:
• Build Content Language Model
o Cross-sender content LMs (5-gram w/o smoothing)
For each generated structural label:
1) Stochastically Generate Text Content 2) Score Email Candidates
o We penalize the synthesized email if it:
contains slots without provided values
doesn’t have the required slots 3) Fill Slots
o Tomorrow’s [meeting] is at [location].
Tomorrow’s speech seminar is at Gates building.
6. Experiments
(%) Template Stochastic No Diff
Coherence 36.19 38.57 25.24
Fluency 28.10 40.48 31.43
Naturalness 35.71 45.71 18.57
Preference 36.67 42.86 20.48
Overall 34.17 41.90 23.93
A sender may have personal style about email structure.
Emails about the same topic may have similar structures.
7. Conclusions
Sender style can be noticed by subjects based on greeting usage, politeness, the length of email, etc.