gram Transformation - Clean Up - Amazon Machine Learning

Step 6: Clean Up

N- gram Transformation

• Orthogonal Sparse Bigram (OSB) Transformation (p. 81)

• Lowercase Transformation (p. 81)

• Remove Punctuation Transformation (p. 82)

• Quantile Binning Transformation (p. 82)

• Normalization Transformation (p. 82)

• Cartesian Product Transformation (p. 83)

N-gram Transformation

The n-gram transformation takes a text variable as input and produces strings corresponding to sliding a window of (user-conﬁgurable) n words, generating outputs in the process. For example, consider the text string "I really enjoyed reading this book".

Specifying the n-gram transformation with window size=1 simply gives you all the individual words in that string:

{"I", "really", "enjoyed", "reading", "this", "book"}

Specifying the n-gram transformation with window size =2 gives you all the two-word combinations as well as the one-word combinations:

{"I really", "really enjoyed", "enjoyed reading", "reading this", "this book", "I", "really", "enjoyed", "reading", "this", "book"}

Specifying the n-gram transformation with window size = 3 will add the three-word combinations to this list, yielding the following:

{"I really enjoyed", "really enjoyed reading", "enjoyed reading this",

"reading this book", "I really", "really enjoyed", "enjoyed reading",

"reading this", "this book", "I", "really", "enjoyed", "reading",

"this", "book"}

You can request n-grams with a size ranging from 2-10 words. N-grams with size 1 are generated implicitly for all inputs whose type is marked as text in the data schema, so you do not have to ask for them. Finally, keep in mind that n-grams are generated by breaking the input data on whitespace characters. That means that, for example, punctuation characters will be considered a part of the word tokens: generating n-grams with a window of 2 for string "red, green, blue" will yield {"red,", "green,",

Orthogonal Sparse Bigram (OSB) Transformation

"blue,", "red, green", "green, blue"}. You can use the punctuation remover processor (described later in this document) to remove the punctuation symbols if this is not what you want.

To compute n-grams of window size 3 for variable var1:

"ngram(var1, 3)"

Orthogonal Sparse Bigram (OSB) Transformation

The OSB transformation is intended to aid in text string analysis and is an alternative to the bi-gram transformation (n-gram with window size 2). OSBs are generated by sliding the window of size n over the text, and outputting every pair of words that includes the ﬁrst word in the window.

To build each OSB, its constituent words are joined by the "_" (underscore) character, and every skipped token is indicated by adding another underscore into the OSB. Thus, the OSB encodes not just the tokens seen within a window, but also an indication of number of tokens skipped within that same window.

To illustrate, consider the string "The quick brown fox jumps over the lazy dog", and OSBs of size 4. The six four-word windows, and the last two shorter windows from the end of the string are shown in the following example, as well OSBs generated from each:

Window, {OSBs generated}

"The quick brown fox", {The_quick, The__brown, The___fox}

"quick brown fox jumps", {quick_brown, quick__fox, quick___jumps}

"brown fox jumps over", {brown_fox, brown__jumps, brown___over}

"fox jumps over the", {fox_jumps, fox__over, fox___the}

"jumps over the lazy", {jumps_over, jumps__the, jumps___lazy}

"over the lazy dog", {over_the, over__lazy, over___dog}

"the lazy dog", {the_lazy, the__dog}

"lazy dog", {lazy_dog}

Orthogonal sparse bigrams are an alternative for n-grams that might work better in some situations.

If your data has large text ﬁelds (10 or more words), experiment to see which works better. Note that what constitutes a large text ﬁeld may vary depending on the situation. However, with larger text ﬁelds, OSBs have been empirically shown to uniquely represent the text due to the special skip symbol (the underscore).

You can request a window size of 2 to 10 for OSB transformations on input text variables.

To compute OSBs with window size 5 for variable var1:

"osb(var1, 5)"

Lowercase Transformation

The lowercase transformation processor converts text inputs to lowercase. For example, given the input

"The Quick Brown Fox Jumps Over the Lazy Dog", the processor will output "the quick brown fox jumps over the lazy dog".

Remove Punctuation Transformation

To apply lowercase transformation to the variable var1:

"lowercase(var1)"

Remove Punctuation Transformation

Amazon ML implicitly splits inputs marked as text in the data schema on whitespace. Punctuation in the string ends up either adjoining word tokens, or as separate tokens entirely, depending on the whitespace surrounding it. If this is undesirable, the punctuation remover transformation may be used to remove punctuation symbols from generated features. For example, given the string "Welcome to AML - please fasten your seat-belts!", the following set of tokens is implicitly generated:

{"Welcome", "to", "Amazon", "ML", "-", "please", "fasten", "your", "seat-belts!"}

Applying the punctuation remover processor to this string results in this set:

{"Welcome", "to", "Amazon", "ML", "please", "fasten", "your", "seat-belts"}

Note that only the preﬁx and suﬃx punctuation marks are removed. Punctuations that appear in the middle of a token, e.g. the hyphen in "seat-belts", are not removed.

To apply punctuation removal to the variable var1:

"no_punct(var1)"

Quantile Binning Transformation

The quantile binning processor takes two inputs, a numerical variable and a parameter called bin number, and outputs a categorical variable. The purpose is to discover non-linearity in the variable's distribution by grouping observed values together.

In many cases, the relationship between a numeric variable and the target is not linear (the numeric variable value does not increase or decrease monotonically with the target). In such cases, it might be useful to bin the numeric feature into a categorical feature representing diﬀerent ranges of the numeric feature. Each categorical feature value (bin) can then be modeled as having its own linear relationship with the target. For example, let's say you know that the continuous numeric feature account_age is not linearly correlated with likelihood to purchase a book. You can bin age into categorical features that might be able to capture the relationship with the target more accurately.

The quantile binning processor can be used to instruct Amazon ML to establish n bins of equal size based on the distribution of all input values of the age variable, and then to substitute each number with a text token containing the bin. The optimum number of bins for a numeric variable is dependent on characteristics of the variable and its relationship to the target, and this is best determined through experimentation. Amazon ML suggests the optimal bin number for a numeric feature based on data statistics in the Suggested Recipe.

You can request between 5 and 1000 quantile bins to be computed for any numeric input variable.

To following example shows how to compute and use 50 bins in place of numeric variable var1:

"quantile_bin(var1, 50)"

Normalization Transformation

The normalization transformer normalizes numeric variables to have a mean of zero and variance of one.

Normalization of numeric variables can help the learning process if there are very large range diﬀerences

Cartesian Product Transformation

between numeric variables because variables with the highest magnitude could dominate the ML model, no matter if the feature is informative with respect to the target or not.

To apply this transformation to numeric variable var1, add this to the recipe:

normalize(var1)

This transformer can also take a user deﬁned group of numeric variables or the pre-deﬁned group for all numeric variables (ALL_NUMERIC) as input:

normalize(ALL_NUMERIC) Note

It is not mandatory to use the normalization processor for numeric variables.

Cartesian Product Transformation

The Cartesian transformation generates permutations of two or more text or categorical input variables.

This transformation is used when an interaction between variables is suspected. For example, consider the bank marketing dataset that is used in Tutorial: Using Amazon ML to Predict Responses to a

Marketing Oﬀer. Using this dataset, we would like to predict whether a person would respond positively to a bank promotion, based on the economic and demographic information. We might suspect that the person's job type is somewhat important (perhaps there is a correlation between being employed in certain ﬁelds and having the money available), and the highest level of education attained is also important. We might also have a deeper intuition that there is a strong signal in the interaction of these two variables—for example, that the promotion is particularly well-suited to customers who are entrepreneurs who earned a university degree.

The Cartesian product transformation takes categorical variables or text as input, and produces new features that capture the interaction between these input variables. Speciﬁcally, for each training example, it will create a combination of features, and add them as a standalone feature. For example, let's say our simpliﬁed input rows look like this:

target, education, job

0, university.degree, technician 0, high.school, services 1, university.degree, admin

If we specify that the Cartesian transformation is to be applied to the categorical variables education and job ﬁelds, the resultant feature education_job_interaction will look like this:

target, education_job_interaction 0, university.degree_technician 0, high.school_services

1, university.degree_admin

The Cartesian transformation is even more powerful when it comes to working on sequences of tokens, as is the case when one of its arguments is a text variable that is implicitly or explicitly split into tokens.

For example, consider the task of classifying a book as being a textbook or not. Intuitively, we might think that there is something about the book's title that can tell us it is a textbook (certain words might occur more frequently in textbooks' titles), and we might also think that there is something about the book's binding that is predictive (textbooks are more likely to be hardcover), but it's really the

Data Rearrangement

combination of some words in the title and binding that is most predictive. For a real-world example, the following table shows the results of applying the Cartesian processor to the input variables binding and title:

TextbookTitle Binding Cartesian product of no_punct(Title) and Binding 1 Economics:

0 Fun With Problems Softcover {"Fun_Softcover", "With_Softcover", "Problems_Softcover"}

The following example shows how to apply the Cartesian transformer to var1 and var2:

cartesian(var1, var2)

Data Rearrangement

The data rearrangement functionality enables you to create a datasource that is based on only a portion of the input data that it points to. For example, when you create an ML Model using the Create ML Model wizard in the Amazon ML console, and choose the default evaluation option, Amazon ML automatically reserves 30% of your data for ML model evaluation, and uses the other 70% for training.

This functionality is enabled by the Data Rearrangement feature of Amazon ML.

If you are using the Amazon ML API to create datasources, you can specify which part of the input data a new datasource will be based. You do this by passing instructions in the DataRearrangement parameter to the CreateDataSourceFromS3, CreateDataSourceFromRedshift or

CreateDataSourceFromRDS APIs. The contents of the DataRearrangement string are a JSON string containing the beginning and end locations of your data, expressed as percentages, a complement ﬂag, and a splitting strategy. For example, the following DataRearrangement string speciﬁes that the ﬁrst 70% of the data will be used to create the datasource:

{ "splitting": {

To change how Amazon ML creates a datasource, use the follow parameters.

PercentBegin (Optional)

Use percentBegin to indicate where the data for the datasource starts. If you do not include percentBegin and percentEnd, Amazon ML includes all of the data when creating the datasource.

DataRearrangement Parameters

Valid values are 0 to 100, inclusive.

PercentEnd (Optional)

Use percentEnd to indicate where the data for the datasource ends. If you do not include percentBegin and percentEnd, Amazon ML includes all of the data when creating the datasource.

Valid values are 0 to 100, inclusive.

Complement (Optional)

The complement parameter tells Amazon ML to use the data that is not included in the range of percentBegin to percentEnd to create a datasource. The complement parameter is useful if you need to create complementary datasources for training and evaluation. To create a complementary datasource, use the same values for percentBegin and percentEnd, along with the complement parameter.

For example, the following two datasources do not share any data, and can be used to train and evaluate a model. The ﬁrst datasource has 25 percent of the data, and the second one has 75 percent of the data.

Valid values are true and false.

Strategy (Optional)

To change how Amazon ML splits the data for a datasource, use the strategy parameter.

The default value for the strategy parameter is sequential, meaning that Amazon ML takes all of the data records between the percentBegin and percentEnd parameters for the datasource, in the order that the records appear in the input data

The following two DataRearrangement lines are examples of sequentially ordered training and evaluation datasources:

Datasource for evaluation: {"splitting":{"percentBegin":70, "percentEnd":100,

"strategy":"sequential"}}

Datasource for training: {"splitting":{"percentBegin":70, "percentEnd":100,

"strategy":"sequential", "complement":"true"}}

To create a datasource from a random selection of the data, set the strategy parameter to random and provide a string that is used as the seed value for the random data splitting (for example, you

DataRearrangement Parameters

can use the S3 path to your data as the random seed string). If you choose the random split strategy, Amazon ML assigns each row of data a pseudo-random number, and then selects the rows that have an assigned number between percentBegin and percentEnd. Pseudo-random numbers are assigned using the byte oﬀset as a seed, so changing the data results in a diﬀerent split. Any existing ordering is preserved. The random splitting strategy ensures that variables in the training and evaluation data are distributed similarly. It is useful in the cases where the input data may have an implicit sort order, which would otherwise result in training and evaluation datasources containing non-similar data records.

The following two DataRearrangement lines are examples of non-sequentially ordered training and evaluation datasources:

Datasource for evaluation:

{ "splitting":{

"percentBegin":70, "percentEnd":100, "strategy":"random", "strategyParams": {

"randomSeed":"RANDOMSEED"

} } }

Datasource for training:

{ "splitting":{

"percentBegin":70, "percentEnd":100, "strategy":"random", "strategyParams": {

"randomSeed":"RANDOMSEED"

}

"complement":"true"

} }

Valid values are sequential and random.

(Optional) Strategy:RandomSeed

Amazon ML uses the randomSeed to split the data. The default seed for the API is an empty string.

To specify a seed for the random split strategy, pass in a string. For more information about random seeds, see Randomly Splitting Your Data (p. 45) in the Amazon Machine Learning Developer Guide.

For sample code that demonstrates how to use cross-validation with Amazon ML, go to Github Machine Learning Samples.

ML Model Insights

Evaluating ML Models

You should always evaluate a model to determine if it will do a good job of predicting the target on new and future data. Because future instances have unknown target values, you need to check the accuracy metric of the ML model on data for which you already know the target answer, and use this assessment as a proxy for predictive accuracy on future data.

To properly evaluate a model, you hold out a sample of data that has been labeled with the target (ground truth) from the training datasource. Evaluating the predictive accuracy of an ML model with the same data that was used for training is not useful, because it rewards models that can "remember" the training data, as opposed to generalizing from it. Once you have ﬁnished training the ML model, you send the model the held-out observations for which you know the target values. You then compare the predictions returned by the ML model against the known target value. Finally, you compute a summary metric that tells you how well the predicted and true values match.

In Amazon ML, you evaluate an ML model by creating an evaluation. To create an evaluation for an ML model, you need an ML model that you want to evaluate, and you need labeled data that was not used for training. First, create a datasource for evaluation by creating an Amazon ML datasource with the held-out data. The data used in the evaluation must have the same schema as the data used in training and include actual values for the target variable.

If all your data is in a single ﬁle or directory, you can use the Amazon ML console to split the data. The default path in the Create ML model wizard splits the input datasource and uses the ﬁrst 70% for a training datasource and the remaining 30% for an evaluation datasource. You can also customize the split ratio by using the Custom option in the Create ML model wizard, where you can choose to select a random 70% sample for training and use the remaining 30% for evaluation. To further specify custom split ratios, use the data rearrangement string in the Create Datasource API. Once you have an evaluation datasource and an ML model, you can create an evaluation and review the results of the evaluation.

Topics

• ML Model Insights (p. 87)

• Binary Model Insights (p. 88)

• Multiclass Model Insights (p. 91)

• Regression Model Insights (p. 92)

• Preventing Overﬁtting (p. 94)

• Cross-Validation (p. 95)

• Evaluation Alerts (p. 96)

ML Model Insights

When you evaluate an ML model, Amazon ML provides an industry-standard metric and a number of insights to review the predictive accuracy of your model. In Amazon ML, the outcome of an evaluation contains the following:

• A prediction accuracy metric to report on the overall success of the model

• Visualizations to help explore the accuracy of your model beyond the prediction accuracy metric

• The ability to review the impact of setting a score threshold (only for binary classiﬁcation)

Binary Model Insights

• Alerts on criteria to check the validity of the evaluation

The choice of the metric and visualization depends on the type of ML model that you are evaluating. It is important to review these visualizations to decide if your model is performing well enough to match your business requirements.

Binary Model Insights

Interpreting the Predictions

The actual output of many binary classiﬁcation algorithms is a prediction score. The score indicates the system's certainty that the given observation belongs to the positive class (the actual target value is 1). Binary classiﬁcation models in Amazon ML output a score that ranges from 0 to 1. As a consumer of this score, to make the decision about whether the observation should be classiﬁed as 1 or 0, you interpret the score by picking a classiﬁcation threshold, or cut-oﬀ, and compare the score against it. Any observations with scores higher than the cut-oﬀ are predicted as target= 1, and scores lower than the cut-oﬀ are predicted as target= 0.

In Amazon ML, the default score cut-oﬀ is 0.5. You can choose to update this cut-oﬀ to match your business needs. You can use the visualizations in the console to understand how the choice of cut-oﬀ will aﬀect your application.

Measuring ML Model Accuracy

Amazon ML provides an industry-standard accuracy metric for binary classiﬁcation models called Area Under the (Receiver Operating Characteristic) Curve (AUC). AUC measures the ability of the model to predict a higher score for positive examples as compared to negative examples. Because it is independent of the score cut-oﬀ, you can get a sense of the prediction accuracy of your model from the AUC metric without picking a threshold.

The AUC metric returns a decimal value from 0 to 1. AUC values near 1 indicate an ML model that is

在文檔中 Amazon Machine Learning (頁 87-143)