6.6 配置本地 IDE(SSH 工具连接)
7.4.3 编写自定义算子
用户通过自定义算子功能,可以实现个性化的算子编写。
用户单击“新增自定义算子”图标 ,新建并打开一个模板算子,即一个算子编辑 器(相当于Ipython Notebook的一个cell),输入自定义算子名称,即可以在新建的 算子编辑器里面实现自定义算子开发,如图7-44所示。用户可进行算子编辑、调试、
复制、保存等功能。
图7-44 新增自定义算子
说明
class MLSClassName:
# core code for customized algorithm def run(self):
# get upper output of workflow
self.upper_output = self.inputs["upper_output"]
# ...core code...
# output format self._outputs = {
"output_port_1": "output_result"
}
# user called method for getting algorithm result def get_outputs(self):
return self._outputs
# call form for algorithm inputs = {
"upper_output": None #@input {"type":"DataFrame"}
}params = { "inputs": inputs,
"param_1": "param_value_1", #@param {"label":"param_1","type":"string","required":"false","helpTip":""}
"param_2": "param_value_2", #@param
{"label":"param_1","type":"enum","options":"one,two,three","required":"true","helpTip":""}
"param_3": "param_value_3", #@param
{"label":"param_1","type":"integer","range":"(0,none)","required":"true","helpTip":""}
"param_4": "param_value_4" #@param
{"label":"param_1","type":"number","range":"(0,1)","required":"true","helpTip":""}
}mls_instance_#id# = MLSClassName(**params)
mls_instance_#id#.run()
#@output {"label":"dataframe","name":"mls_instance_#id#.get_outputs() ['output_port_1']","type":"DataFrame"}
输入设置编写指引
如代码模板所示,上游算子的输出作为该算子的输入,通过字典inputs的形式传给类对 象,从而实现算链上游算子和当前算子的数据传递。
#@input标记,能够触发前端的界面响应,实现该算子的输入端口的定义,从而和上
输出设置编写指引
inputs dataframe inputs为字典类型,dataframe为pyspark中的 DataFrame类型对象
"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
}params = { "inputs": inputs,
"select_columns_str": "" # @param
{"label":"select_columns_str","type":"string","required":"true","helpTip":""}
}box_plot____id___ = MLSBoxPlot(**params)
box_plot____id___.run()
inputs dataframe inputs为字典类型,dataframe为pyspark中的 DataFrame类型对象
bucket_num - 默认桶个数为10
样例
inputs = {
"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
7.5.1.1.3 相关性分析
概述
对数据集的数值列进行相关性分析。
输入
参数 子参数 参数说明
inputs dataframe inputs为字典类型,dataframe为pyspark中 的DataFrame类型对象
输出
统计结果数据集
参数说明
参数 子参数 参数说明
selected_columns_str - 选择的列组成的格式化字符串,列必须为数 值类型,例如:
"column_a"
"column_a,column_b"
method - 采用相关性分析的方法,支持"pearson"和
"spearman"
样例
inputs = {
"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
}params = { "inputs": inputs,
"selected_columns_str": "", # @param
{"label":"selected_columns_str","type":"string","required":"false","helpTip":""}
"method": "pearson" # @param
{"label":"method","type":"enum","required":"true","options":"pearson,spearman","helpTip":""}
}correlation_analysis____id___ = MLSCorrelationAnalysis(**params) correlation_analysis____id___.run()
输入
参数 子参数 参数说明
inputs dataframe 参数必选,表示输入的数据集。
如果没有pipeline_model和
decision_tree_classify_model参数,表示直 接根据数据集训练决策树分类算法得到特征 重要性
pipeline_model 参数可选,如果含有该参数,表示根据上游 的pyspark pipeline模型对象来计算特征重要 性
decision_tree_classi
fy_model 参数可选,如果含有该参数,表示根据上游 的决策树分类模型对象来计算特征重要性
输出
包含特征重要性的结果数据集
参数说明
参数 子参数 参数说明
input_columns_str - 数据集的特征列名组成的格式化字符串,例 如:
"column_a"
"column_a,column_b"
label_col - 目标列名
model_input_featu
prediction_col - 训练模型时,预测结果对应的列名,默认为
"prediction"
参数 子参数 参数说明
impurity - 计算信息增益的标准,支持"gini"和
"entropy"
样例
inputs = {
"dataframe": None, # @input {"label":"dataframe","type":"DataFrame"}
"pipeline_model": None, # @input {"label":"pipeline_model","type":"PipelineModel"}
"decision_tree_classify_model": None }params = {
"inputs": inputs,
"input_columns_str": "", # @param
{"label":"input_columns_str","type":"string","required":"false","helpTip":""}
"label_col": "", # @param {"label":"label_col","type":"string","required":"true","helpTip":""}
"model_input_features_col": "model_features", # @param
{"label":"model_input_features_col","type":"string","required":"false","helpTip":""}
"classifier_label_index_col": "label_index", # @param
{"label":"classifier_label_index_col","type":"string","required":"false","helpTip":""}
"prediction_index_col": "prediction_index", # @param
{"label":"prediction_index_col","type":"string","required":"false","helpTip":""}
"prediction_col": "prediction", # @param
{"label":"prediction_col","type":"string","required":"false","helpTip":""}
"max_depth": 5, # @param
{"label":"max_depth","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""}
"max_bins": 32, # @param
{"label":"max_bins","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""}
"min_instances_per_node": 1, # @param
{"label":"min_instances_per_node","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""}
"min_info_gain": 0.0, # @param {"label":"min_info_gain","type":"number","required":"false","helpTip":""}
"impurity": "gini" # @param
{"label":"impurity","type":"enum","required":"false","options":"entropy,gini","helpTip":""}
}dt_classify_feature_importance____id___ = MLSDecisionTreeClassifierFeatureImportance(**params)
dt_classify_feature_importance____id___.run()
# @output {"label":"dataframe","name":"dt_classify_feature_importance____id___.get_outputs() ['output_port_1']","type":"DataFrame"}
参数 子参数 参数说明
pipeline_model 参数可选,如果含有该参数,表示根据上游的 pyspark pipeline模型对象来计算特征重要性 decision_tree_r
input_columns_str - 数据集的特征列名组成的格式化字符串,例如:
"column_a"
"column_a,column_b"
label_col - 目标列名
model_input_featu
res_col - 特征向量的列名
prediction_col - 训练模型时,预测结果对应的列名,默认为
"prediction"
max_depth - 树的最大深度,默认为5
max_bins - 分割特征时的最大分箱个数,默认为32 min_instances_per_
node - 决策树分裂时要求每个节点必须包含的实例数
目,默认为1
min_info_gain - 最小信息增益,默认为0.0
样例
inputs = {
"max_depth": 5, # @param
{"label":"max_depth","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""}
"max_bins": 32, # @param
{"label":"max_bins","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""}
"min_instances_per_node": 1, # @param
{"label":"min_instances_per_node","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""}
"min_info_gain": 0.0, # @param {"label":"min_info_gain","type":"number","required":"false","helpTip":""}
"impurity": "variance"
}dt_regression_feature_importance____id___ = MLSDecisionTreeRegressorFeatureImportance(**params)
dt_regression_feature_importance____id___.run()
# @output {"label":"dataframe","name":"dt_regression_feature_importance____id___.get_outputs() ['output_port_1']","type":"DataFrame"}
inputs datafra
me 参数必选,表示输入的数据集;如果没有pipeline_model 和gbt_classify_model参数,表示直接根据数据集训练 gbdt分类模型得到特征重要性
pipeline_
model 参数可选,如果含有该参数,表示根据上游的pyspark pipeline模型对象pipeline_model来计算特征重要性
参数 子参数 参数说明 classifier_lab
el_index_col - 将目标列按照标签编码后的列名,默认为"label_index"
prediction_i
ndex_col - 训练模型时,预测结果对应标签的列名,默认为
"prediction_index"
prediction_c
ol - 训练模型时,预测结果对应的列名,默认为"prediction"
max_depth - 树的最大深度
max_bins - 特征分裂时的最大分箱个数 min_instanc
es_per_node - 树分裂时要求每个节点必须包含的实例数目,默认为1 min_info_ga
in - 最小信息增益
max_iter - 最大迭代次数 step_size - 步长
subsampling
_rate - 训练每棵树时,对训练集的抽样率
样例
inputs = {
"dataframe": None, # @input {"label":"dataframe","type":"DataFrame"}
"pipeline_model": None, # @input {"label":"pipeline_model","type":"PipelineModel"}
"gbt_classify_model": None }params = {
"inputs": inputs,
"input_columns_str": "", # @param {"label": "input_columns_str", "type": "string", "required": "false",
"helpTip": ""}
"label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": ""}
"model_input_features_col": "model_features", # @param {"label": "model_input_features_col", "type":
"string", "required": "false", "helpTip": ""}
"classifier_label_index_col": "label_index", # @param {"label": "classifier_label_index_col", "type":
"string", "required": "false", "helpTip": ""}
"prediction_index_col": "prediction_index", # @param {"label": "prediction_index_col", "type": "string",
"required": "false", "helpTip": ""}
"prediction_col": "prediction", # @param {"label": "prediction_col", "type": "string", "required": "false",
"subsampling_rate": 1.0 # @param {"label": "subsampling_rate", "type": "number", "required": "false",
"helpTip": ""}
}gbt_classifier_feature_importance____id___ = MLSGBTClassifierFeatureImportance(**params)
gbt_classifier_feature_importance____id___.run()
# @output {"label":"dataframe","name":"gbt_classifier_feature_importance____id___.get_outputs() ['output_port_1']","type":"DataFrame"}
inputs datafram
e 参数必选,表示输入的数据集;如果没有
pipeline_model和gbt_regressor_model参数,表示直接 根据数据集训练梯度提升树回归模型得到特征重要性 pipeline_
model 参数可选,如果含有该参数,表示根据上游的pyspark pipeline模型对象pipeline_model来计算特征重要性
label_col - 目标列名 model_input
_features_co - 特征向量的列名
参数 子参数 参数说明 min_instanc
es_per_node - 决策树分裂时要求每个节点必须包含的实例数目,默认 为1
min_info_gai
n - 最小信息增益,默认为0
subsampling
_rate - 训练每棵树时,对训练集的抽样率,默认为1 max_iter - 最大迭代次数,默认为20
step_size - 步长,默认为0.1
样例
inputs = {
"dataframe": None, # @input {"label":"dataframe","type":"DataFrame"}
"pipeline_model": None, # @input {"label":"pipeline_model","type":"PipelineModel"}
"gbt_regressor_model": None }params = {
"inputs": inputs,
"input_columns_str": "", # @param {"label": "input_columns_str", "type": "string", "required": "false",
"helpTip": ""}
"label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": ""}
"model_input_features_col": "model_features", # @param {"label": "model_input_features_col", "type":
"string", "required": "false", "helpTip": ""}
"prediction_col": "prediction", # @param {"label": "prediction_col", "type": "string", "required": "false",
"helpTip": ""}
"max_depth": 5, # @param {"label": "max_depth", "type": "integer", "required":
"false","range":"(0,2147483647]", "helpTip": ""}
"max_bins": 32, # @param {"label": "max_bins", "type": "integer", "required":
"false","range":"(0,2147483647]", "helpTip": ""}
"min_instances_per_node": 1, # @param {"label": "min_instances_per_node", "type": "integer",
"required": "false","range":"(0,2147483647]", "helpTip": ""}
"min_info_gain": 0.0, # @param {"label": "min_info_gain", "type": "number", "required": "false",
"helpTip": ""}
"subsampling_rate": 1.0, # @param {"label": "subsampling_rate", "type": "number", "required": "false",
"helpTip": ""}
"loss_type": "squared", # @param {"label": "loss_type", "type": "enum", "required": "false", "options":
"squared, absolute", "helpTip": ""}
"max_iter": 20, # @param {"label": "max_iter", "type": "integer", "required":
"false","range":"(0,2147483647]", "helpTip": ""}
"step_size": 0.1, # @param {"label": "step_size", "type": "number", "required": "false", "helpTip": ""}
"impurity": "variance"
}gbt_regression_feature_importance____id___ = MLSGBTRegressorFeatureImportance(**params)
gbt_regression_feature_importance____id___.run()
输入
参数 子参数 参数说明
inputs dataframe inputs为字典类型,dataframe为pyspark 中的DataFrame类型对象
输出
数据集
参数说明
参数 子参数 参数说明
select_columns_str - 列名组成的格式化字符串,例如:
"column_a"
"column_a,column_b"
n_estimators - 基学习器的数量,默认为100
max_samples - 从数据集中抽取多少个样本来训练,支持
"auto"、int类型、float类型 contamination -
-max_features - 从数据集中抽取多少数量的特征来训练每 个基训练器
bootstrap - 构建树时,下次是否替换采样,True表示 替换,False表示不替换
样例
inputs = {
"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
}params = { "inputs": inputs,
"select_columns_str": "", # @param
{"label":"select_columns_str","type":"string","required":"false","helpTip":""}
"n_estimators": 100, # @param {"label":"n_estimators","type":"integer","required":"false","helpTip":""}
"max_samples": "auto", # @param {"label":"max_samples","type":"string","required":"false","helpTip":""}
"contamination": "auto", # @param
7.5.1.1.9 百分位统计
概述
对用户选择的数值列进行百分位统计。
输入
参数 子参数 参数说明
inputs dataframe inputs为字典类型,dataframe为pyspark中的 DataFrame类型对象
"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
}params = { "inputs": inputs,
"select_columns_str": "" # @param
{"label":"select_columns_str","type":"string","required":"false","helpTip":""}
}percentile_statistics____id___ = MLSPercentileStatistics(**params) percentile_statistics____id___.run()
# @output {"label":"dataframe","name":"percentile_statistics____id___.get_outputs() ['output_port_1']","type":"DataFrame"}
7.5.1.1.10 直方图
输入
参数 子参数 参数说明
inputs dataframe inputs为字典类型,dataframe为pyspark中 的DataFrame类型对象
输出
无
参数说明
参数 子参数 参数说明
select_column_name - 选择列的列名 string_bucket_show_
numerical_interval - 如果选择列为数值列,该参数表示特征值的 区间长度
样例
inputs = {
"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
}params = { "inputs": inputs,
"select_column_name": "", # @param
{"label":"select_column_name","type":"string","required":"true","helpTip":""}
"string_bucket_show_num": 10, # @param
{"label":"string_bucket_show_num","type":"integer","required":"true","helpTip":""}
"numerical_bucket_show_num": 10, # @param
{"label":"numerical_bucket_show_num","type":"integer","required":"true","helpTip":""}
"numerical_interval": 0.05 # @param
{"label":"numerical_interval","type":"float","required":"true","helpTip":""}
}plot_bar_chart____id___ = MLSPlotBarChart(**params)
plot_bar_chart____id___.run()
7.5.1.1.11 折线图
输入
参数 子参数 参数说明
inputs dataframe inputs为字典类型,dataframe为pyspark中 的DataFrame类型对象。
输出
无
参数说明
参数 子参数 参数说明
select_columns_str - 列名组成的格式化字符串,例如:
"column_a"
"column_a,column_b"
start_index - 画折线图时,数据集转成的数组的起始索引
end_index - 画折线图时,数据集转成的数组的终点索引
figure_length - 图的长度 figure_width - 图的宽度
样例
inputs = {
"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
}params = { "inputs": inputs,
"select_columns_str": "", # @param
{"label":"select_columns_str","type":"string","required":"true","helpTip":""}
"start_index": 0, # @param {"label":"start_index","type":"integer","required":"true","helpTip":""}
"end_index": 0, # @param {"label":"end_index","type":"integer","required":"true","helpTip":""}
"figure_length": 30, # @param {"label":"figure_length","type":"integer","required":"false","helpTip":""}
"figure_width": 10 # @param {"label":"figure_width","type":"integer","required":"false","helpTip":""}
}plot_line____id___ = MLSPlotLine(**params)
输入
参数 子参数 参数说明
inputs dataframe inputs为字典类型,dataframe为pyspark中的 DataFrame类型对象
al_length - 如果numeric_intervals_str没有设置,默认饼形图 的每个区间的长度一样,
numeric_interval_length表示此时的区间长度 show_share_nu
mber - 饼形图的份额数目,默认为5
figure_length - 图的长度 figure_width - 图的宽度
样例
inputs = {
"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
}params = { "inputs": inputs,
"select_column_name": "", # @param
{"label":"select_column_name","type":"string","required":"true","helpTip":""}
"numeric_intervals_str": "", # @param
{"label":"numeric_intervals_str","type":"string","required":"false","helpTip":""}
"numeric_interval_length": "", # @param
{"label":"numeric_interval_length","type":"string","required":"false","helpTip":""}
"show_share_number": 5, # @param
{"label":"show_share_number","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""}
7.5.1.1.13 散点图
概述
对数据集画出对应的散点图。
输入
参数 子参数 参数说明
inputs dataframe inputs为字典类型,dataframe为 pyspark中的DataFrame类型对象
输出
无
参数说明
参数 子参数 参数说明
start_index - 只对数据集转成的数组的某个区间内元
素化散点图,start_index表示开始位置
end_index - 只对数据集转成的数组的某个区间内元
素化散点图,end_index表示结束位置 x_axis_column_name - 散点图x轴的列名
y_axis_columns_str - 散点图y轴的某些列,
y_axis_columns_str表示用列名逗号隔 开的字符串
figure_length - 图的长度
figure_width - 图的宽度
样例
inputs = {
"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
}
{"label":"figure_width","type":"integer","required":"false","range":"[0,2147483647]","helpTip":""}
}plot_scatter____id___ = MLSPlotScatter(**params) plot_scatter____id___.run()
7.5.1.1.14 随机森林分类特征重要性
概述
采用随机森林分类算法计算数据集特征的特征重要性
采用随机森林分类算法计算数据集特征的特征重要性