• 沒有找到結果。

6.6 配置本地 IDE(SSH 工具连接)

7.4.3 编写自定义算子

用户通过自定义算子功能,可以实现个性化的算子编写。

用户单击“新增自定义算子”图标 ,新建并打开一个模板算子,即一个算子编辑 器(相当于Ipython Notebook的一个cell),输入自定义算子名称,即可以在新建的 算子编辑器里面实现自定义算子开发,如图7-44所示。用户可进行算子编辑、调试、

复制、保存等功能。

7-44 新增自定义算子

说明

class MLSClassName:

# core code for customized algorithm def run(self):

# get upper output of workflow

self.upper_output = self.inputs["upper_output"]

# ...core code...

# output format self._outputs = {

"output_port_1": "output_result"

}

# user called method for getting algorithm result def get_outputs(self):

return self._outputs

# call form for algorithm inputs = {

"upper_output": None #@input {"type":"DataFrame"}

}params = { "inputs": inputs,

"param_1": "param_value_1", #@param {"label":"param_1","type":"string","required":"false","helpTip":""}

"param_2": "param_value_2", #@param

{"label":"param_1","type":"enum","options":"one,two,three","required":"true","helpTip":""}

"param_3": "param_value_3", #@param

{"label":"param_1","type":"integer","range":"(0,none)","required":"true","helpTip":""}

"param_4": "param_value_4" #@param

{"label":"param_1","type":"number","range":"(0,1)","required":"true","helpTip":""}

}mls_instance_#id# = MLSClassName(**params)

mls_instance_#id#.run()

#@output {"label":"dataframe","name":"mls_instance_#id#.get_outputs() ['output_port_1']","type":"DataFrame"}

输入设置编写指引

如代码模板所示,上游算子的输出作为该算子的输入,通过字典inputs的形式传给类对 象,从而实现算链上游算子和当前算子的数据传递。

#@input标记,能够触发前端的界面响应,实现该算子的输入端口的定义,从而和上

输出设置编写指引

inputs dataframe inputs为字典类型,dataframe为pyspark中的 DataFrame类型对象

"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}

}params = { "inputs": inputs,

"select_columns_str": "" # @param

{"label":"select_columns_str","type":"string","required":"true","helpTip":""}

}box_plot____id___ = MLSBoxPlot(**params)

box_plot____id___.run()

inputs dataframe inputs为字典类型,dataframe为pyspark中的 DataFrame类型对象

bucket_num - 默认桶个数为10

样例

inputs = {

"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}

7.5.1.1.3 相关性分析

概述

对数据集的数值列进行相关性分析。

输入

参数 子参数 参数说明

inputs dataframe inputs为字典类型,dataframe为pyspark中 的DataFrame类型对象

输出

统计结果数据集

参数说明

参数 子参数 参数说明

selected_columns_str - 选择的列组成的格式化字符串,列必须为数 值类型,例如:

"column_a"

"column_a,column_b"

method - 采用相关性分析的方法,支持"pearson"和

"spearman"

样例

inputs = {

"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}

}params = { "inputs": inputs,

"selected_columns_str": "", # @param

{"label":"selected_columns_str","type":"string","required":"false","helpTip":""}

"method": "pearson" # @param

{"label":"method","type":"enum","required":"true","options":"pearson,spearman","helpTip":""}

}correlation_analysis____id___ = MLSCorrelationAnalysis(**params) correlation_analysis____id___.run()

输入

参数 子参数 参数说明

inputs dataframe 参数必选,表示输入的数据集。

如果没有pipeline_model和

decision_tree_classify_model参数,表示直 接根据数据集训练决策树分类算法得到特征 重要性

pipeline_model 参数可选,如果含有该参数,表示根据上游 的pyspark pipeline模型对象来计算特征重要 性

decision_tree_classi

fy_model 参数可选,如果含有该参数,表示根据上游 的决策树分类模型对象来计算特征重要性

输出

包含特征重要性的结果数据集

参数说明

参数 子参数 参数说明

input_columns_str - 数据集的特征列名组成的格式化字符串,例 如:

"column_a"

"column_a,column_b"

label_col - 目标列名

model_input_featu

prediction_col - 训练模型时,预测结果对应的列名,默认为

"prediction"

参数 子参数 参数说明

impurity - 计算信息增益的标准,支持"gini"和

"entropy"

样例

inputs = {

"dataframe": None, # @input {"label":"dataframe","type":"DataFrame"}

"pipeline_model": None, # @input {"label":"pipeline_model","type":"PipelineModel"}

"decision_tree_classify_model": None }params = {

"inputs": inputs,

"input_columns_str": "", # @param

{"label":"input_columns_str","type":"string","required":"false","helpTip":""}

"label_col": "", # @param {"label":"label_col","type":"string","required":"true","helpTip":""}

"model_input_features_col": "model_features", # @param

{"label":"model_input_features_col","type":"string","required":"false","helpTip":""}

"classifier_label_index_col": "label_index", # @param

{"label":"classifier_label_index_col","type":"string","required":"false","helpTip":""}

"prediction_index_col": "prediction_index", # @param

{"label":"prediction_index_col","type":"string","required":"false","helpTip":""}

"prediction_col": "prediction", # @param

{"label":"prediction_col","type":"string","required":"false","helpTip":""}

"max_depth": 5, # @param

{"label":"max_depth","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""}

"max_bins": 32, # @param

{"label":"max_bins","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""}

"min_instances_per_node": 1, # @param

{"label":"min_instances_per_node","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""}

"min_info_gain": 0.0, # @param {"label":"min_info_gain","type":"number","required":"false","helpTip":""}

"impurity": "gini" # @param

{"label":"impurity","type":"enum","required":"false","options":"entropy,gini","helpTip":""}

}dt_classify_feature_importance____id___ = MLSDecisionTreeClassifierFeatureImportance(**params)

dt_classify_feature_importance____id___.run()

# @output {"label":"dataframe","name":"dt_classify_feature_importance____id___.get_outputs() ['output_port_1']","type":"DataFrame"}

参数 子参数 参数说明

pipeline_model 参数可选,如果含有该参数,表示根据上游的 pyspark pipeline模型对象来计算特征重要性 decision_tree_r

input_columns_str - 数据集的特征列名组成的格式化字符串,例如:

"column_a"

"column_a,column_b"

label_col - 目标列名

model_input_featu

res_col - 特征向量的列名

prediction_col - 训练模型时,预测结果对应的列名,默认为

"prediction"

max_depth - 树的最大深度,默认为5

max_bins - 分割特征时的最大分箱个数,默认为32 min_instances_per_

node - 决策树分裂时要求每个节点必须包含的实例数

目,默认为1

min_info_gain - 最小信息增益,默认为0.0

样例

inputs = {

"max_depth": 5, # @param

{"label":"max_depth","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""}

"max_bins": 32, # @param

{"label":"max_bins","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""}

"min_instances_per_node": 1, # @param

{"label":"min_instances_per_node","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""}

"min_info_gain": 0.0, # @param {"label":"min_info_gain","type":"number","required":"false","helpTip":""}

"impurity": "variance"

}dt_regression_feature_importance____id___ = MLSDecisionTreeRegressorFeatureImportance(**params)

dt_regression_feature_importance____id___.run()

# @output {"label":"dataframe","name":"dt_regression_feature_importance____id___.get_outputs() ['output_port_1']","type":"DataFrame"}

inputs datafra

me 参数必选,表示输入的数据集;如果没有pipeline_model 和gbt_classify_model参数,表示直接根据数据集训练 gbdt分类模型得到特征重要性

pipeline_

model 参数可选,如果含有该参数,表示根据上游的pyspark pipeline模型对象pipeline_model来计算特征重要性

参数 子参数 参数说明 classifier_lab

el_index_col - 将目标列按照标签编码后的列名,默认为"label_index"

prediction_i

ndex_col - 训练模型时,预测结果对应标签的列名,默认为

"prediction_index"

prediction_c

ol - 训练模型时,预测结果对应的列名,默认为"prediction"

max_depth - 树的最大深度

max_bins - 特征分裂时的最大分箱个数 min_instanc

es_per_node - 树分裂时要求每个节点必须包含的实例数目,默认为1 min_info_ga

in - 最小信息增益

max_iter - 最大迭代次数 step_size - 步长

subsampling

_rate - 训练每棵树时,对训练集的抽样率

样例

inputs = {

"dataframe": None, # @input {"label":"dataframe","type":"DataFrame"}

"pipeline_model": None, # @input {"label":"pipeline_model","type":"PipelineModel"}

"gbt_classify_model": None }params = {

"inputs": inputs,

"input_columns_str": "", # @param {"label": "input_columns_str", "type": "string", "required": "false",

"helpTip": ""}

"label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": ""}

"model_input_features_col": "model_features", # @param {"label": "model_input_features_col", "type":

"string", "required": "false", "helpTip": ""}

"classifier_label_index_col": "label_index", # @param {"label": "classifier_label_index_col", "type":

"string", "required": "false", "helpTip": ""}

"prediction_index_col": "prediction_index", # @param {"label": "prediction_index_col", "type": "string",

"required": "false", "helpTip": ""}

"prediction_col": "prediction", # @param {"label": "prediction_col", "type": "string", "required": "false",

"subsampling_rate": 1.0 # @param {"label": "subsampling_rate", "type": "number", "required": "false",

"helpTip": ""}

}gbt_classifier_feature_importance____id___ = MLSGBTClassifierFeatureImportance(**params)

gbt_classifier_feature_importance____id___.run()

# @output {"label":"dataframe","name":"gbt_classifier_feature_importance____id___.get_outputs() ['output_port_1']","type":"DataFrame"}

inputs datafram

e 参数必选,表示输入的数据集;如果没有

pipeline_model和gbt_regressor_model参数,表示直接 根据数据集训练梯度提升树回归模型得到特征重要性 pipeline_

model 参数可选,如果含有该参数,表示根据上游的pyspark pipeline模型对象pipeline_model来计算特征重要性

label_col - 目标列名 model_input

_features_co - 特征向量的列名

参数 子参数 参数说明 min_instanc

es_per_node - 决策树分裂时要求每个节点必须包含的实例数目,默认 为1

min_info_gai

n - 最小信息增益,默认为0

subsampling

_rate - 训练每棵树时,对训练集的抽样率,默认为1 max_iter - 最大迭代次数,默认为20

step_size - 步长,默认为0.1

样例

inputs = {

"dataframe": None, # @input {"label":"dataframe","type":"DataFrame"}

"pipeline_model": None, # @input {"label":"pipeline_model","type":"PipelineModel"}

"gbt_regressor_model": None }params = {

"inputs": inputs,

"input_columns_str": "", # @param {"label": "input_columns_str", "type": "string", "required": "false",

"helpTip": ""}

"label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": ""}

"model_input_features_col": "model_features", # @param {"label": "model_input_features_col", "type":

"string", "required": "false", "helpTip": ""}

"prediction_col": "prediction", # @param {"label": "prediction_col", "type": "string", "required": "false",

"helpTip": ""}

"max_depth": 5, # @param {"label": "max_depth", "type": "integer", "required":

"false","range":"(0,2147483647]", "helpTip": ""}

"max_bins": 32, # @param {"label": "max_bins", "type": "integer", "required":

"false","range":"(0,2147483647]", "helpTip": ""}

"min_instances_per_node": 1, # @param {"label": "min_instances_per_node", "type": "integer",

"required": "false","range":"(0,2147483647]", "helpTip": ""}

"min_info_gain": 0.0, # @param {"label": "min_info_gain", "type": "number", "required": "false",

"helpTip": ""}

"subsampling_rate": 1.0, # @param {"label": "subsampling_rate", "type": "number", "required": "false",

"helpTip": ""}

"loss_type": "squared", # @param {"label": "loss_type", "type": "enum", "required": "false", "options":

"squared, absolute", "helpTip": ""}

"max_iter": 20, # @param {"label": "max_iter", "type": "integer", "required":

"false","range":"(0,2147483647]", "helpTip": ""}

"step_size": 0.1, # @param {"label": "step_size", "type": "number", "required": "false", "helpTip": ""}

"impurity": "variance"

}gbt_regression_feature_importance____id___ = MLSGBTRegressorFeatureImportance(**params)

gbt_regression_feature_importance____id___.run()

输入

参数 子参数 参数说明

inputs dataframe inputs为字典类型,dataframe为pyspark 中的DataFrame类型对象

输出

数据集

参数说明

参数 子参数 参数说明

select_columns_str - 列名组成的格式化字符串,例如:

"column_a"

"column_a,column_b"

n_estimators - 基学习器的数量,默认为100

max_samples - 从数据集中抽取多少个样本来训练,支持

"auto"、int类型、float类型 contamination -

-max_features - 从数据集中抽取多少数量的特征来训练每 个基训练器

bootstrap - 构建树时,下次是否替换采样,True表示 替换,False表示不替换

样例

inputs = {

"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}

}params = { "inputs": inputs,

"select_columns_str": "", # @param

{"label":"select_columns_str","type":"string","required":"false","helpTip":""}

"n_estimators": 100, # @param {"label":"n_estimators","type":"integer","required":"false","helpTip":""}

"max_samples": "auto", # @param {"label":"max_samples","type":"string","required":"false","helpTip":""}

"contamination": "auto", # @param

7.5.1.1.9 百分位统计

概述

对用户选择的数值列进行百分位统计。

输入

参数 子参数 参数说明

inputs dataframe inputs为字典类型,dataframe为pyspark中的 DataFrame类型对象

"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}

}params = { "inputs": inputs,

"select_columns_str": "" # @param

{"label":"select_columns_str","type":"string","required":"false","helpTip":""}

}percentile_statistics____id___ = MLSPercentileStatistics(**params) percentile_statistics____id___.run()

# @output {"label":"dataframe","name":"percentile_statistics____id___.get_outputs() ['output_port_1']","type":"DataFrame"}

7.5.1.1.10 直方图

输入

参数 子参数 参数说明

inputs dataframe inputs为字典类型,dataframe为pyspark中 的DataFrame类型对象

输出

参数说明

参数 子参数 参数说明

select_column_name - 选择列的列名 string_bucket_show_

numerical_interval - 如果选择列为数值列,该参数表示特征值的 区间长度

样例

inputs = {

"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}

}params = { "inputs": inputs,

"select_column_name": "", # @param

{"label":"select_column_name","type":"string","required":"true","helpTip":""}

"string_bucket_show_num": 10, # @param

{"label":"string_bucket_show_num","type":"integer","required":"true","helpTip":""}

"numerical_bucket_show_num": 10, # @param

{"label":"numerical_bucket_show_num","type":"integer","required":"true","helpTip":""}

"numerical_interval": 0.05 # @param

{"label":"numerical_interval","type":"float","required":"true","helpTip":""}

}plot_bar_chart____id___ = MLSPlotBarChart(**params)

plot_bar_chart____id___.run()

7.5.1.1.11 折线图

输入

参数 子参数 参数说明

inputs dataframe inputs为字典类型,dataframe为pyspark中 的DataFrame类型对象。

输出

参数说明

参数 子参数 参数说明

select_columns_str - 列名组成的格式化字符串,例如:

"column_a"

"column_a,column_b"

start_index - 画折线图时,数据集转成的数组的起始索引

end_index - 画折线图时,数据集转成的数组的终点索引

figure_length - 图的长度 figure_width - 图的宽度

样例

inputs = {

"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}

}params = { "inputs": inputs,

"select_columns_str": "", # @param

{"label":"select_columns_str","type":"string","required":"true","helpTip":""}

"start_index": 0, # @param {"label":"start_index","type":"integer","required":"true","helpTip":""}

"end_index": 0, # @param {"label":"end_index","type":"integer","required":"true","helpTip":""}

"figure_length": 30, # @param {"label":"figure_length","type":"integer","required":"false","helpTip":""}

"figure_width": 10 # @param {"label":"figure_width","type":"integer","required":"false","helpTip":""}

}plot_line____id___ = MLSPlotLine(**params)

输入

参数 子参数 参数说明

inputs dataframe inputs为字典类型,dataframe为pyspark中的 DataFrame类型对象

al_length - 如果numeric_intervals_str没有设置,默认饼形图 的每个区间的长度一样,

numeric_interval_length表示此时的区间长度 show_share_nu

mber - 饼形图的份额数目,默认为5

figure_length - 图的长度 figure_width - 图的宽度

样例

inputs = {

"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}

}params = { "inputs": inputs,

"select_column_name": "", # @param

{"label":"select_column_name","type":"string","required":"true","helpTip":""}

"numeric_intervals_str": "", # @param

{"label":"numeric_intervals_str","type":"string","required":"false","helpTip":""}

"numeric_interval_length": "", # @param

{"label":"numeric_interval_length","type":"string","required":"false","helpTip":""}

"show_share_number": 5, # @param

{"label":"show_share_number","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""}

7.5.1.1.13 散点图

概述

对数据集画出对应的散点图。

输入

参数 子参数 参数说明

inputs dataframe inputs为字典类型,dataframe为 pyspark中的DataFrame类型对象

输出

参数说明

参数 子参数 参数说明

start_index - 只对数据集转成的数组的某个区间内元

素化散点图,start_index表示开始位置

end_index - 只对数据集转成的数组的某个区间内元

素化散点图,end_index表示结束位置 x_axis_column_name - 散点图x轴的列名

y_axis_columns_str - 散点图y轴的某些列,

y_axis_columns_str表示用列名逗号隔 开的字符串

figure_length - 图的长度

figure_width - 图的宽度

样例

inputs = {

"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}

}

{"label":"figure_width","type":"integer","required":"false","range":"[0,2147483647]","helpTip":""}

}plot_scatter____id___ = MLSPlotScatter(**params) plot_scatter____id___.run()

7.5.1.1.14 随机森林分类特征重要性

概述

采用随机森林分类算法计算数据集特征的特征重要性

采用随机森林分类算法计算数据集特征的特征重要性