7.5 预置算子说明
7.5.1 数据特征
7.5.1.1 数据分析
7.5.1.1.3 相关性分析
概述
对数据集的数值列进行相关性分析。
输入
参数 子参数 参数说明
inputs dataframe inputs为字典类型,dataframe为pyspark中 的DataFrame类型对象
输出
统计结果数据集
参数说明
参数 子参数 参数说明
selected_columns_str - 选择的列组成的格式化字符串,列必须为数 值类型,例如:
"column_a"
"column_a,column_b"
method - 采用相关性分析的方法,支持"pearson"和
"spearman"
样例
inputs = {
"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
}params = { "inputs": inputs,
"selected_columns_str": "", # @param
{"label":"selected_columns_str","type":"string","required":"false","helpTip":""}
"method": "pearson" # @param
{"label":"method","type":"enum","required":"true","options":"pearson,spearman","helpTip":""}
}correlation_analysis____id___ = MLSCorrelationAnalysis(**params) correlation_analysis____id___.run()
输入
参数 子参数 参数说明
inputs dataframe 参数必选,表示输入的数据集。
如果没有pipeline_model和
decision_tree_classify_model参数,表示直 接根据数据集训练决策树分类算法得到特征 重要性
pipeline_model 参数可选,如果含有该参数,表示根据上游 的pyspark pipeline模型对象来计算特征重要 性
decision_tree_classi
fy_model 参数可选,如果含有该参数,表示根据上游 的决策树分类模型对象来计算特征重要性
输出
包含特征重要性的结果数据集
参数说明
参数 子参数 参数说明
input_columns_str - 数据集的特征列名组成的格式化字符串,例 如:
"column_a"
"column_a,column_b"
label_col - 目标列名
model_input_featu
prediction_col - 训练模型时,预测结果对应的列名,默认为
"prediction"
参数 子参数 参数说明
impurity - 计算信息增益的标准,支持"gini"和
"entropy"
样例
inputs = {
"dataframe": None, # @input {"label":"dataframe","type":"DataFrame"}
"pipeline_model": None, # @input {"label":"pipeline_model","type":"PipelineModel"}
"decision_tree_classify_model": None }params = {
"inputs": inputs,
"input_columns_str": "", # @param
{"label":"input_columns_str","type":"string","required":"false","helpTip":""}
"label_col": "", # @param {"label":"label_col","type":"string","required":"true","helpTip":""}
"model_input_features_col": "model_features", # @param
{"label":"model_input_features_col","type":"string","required":"false","helpTip":""}
"classifier_label_index_col": "label_index", # @param
{"label":"classifier_label_index_col","type":"string","required":"false","helpTip":""}
"prediction_index_col": "prediction_index", # @param
{"label":"prediction_index_col","type":"string","required":"false","helpTip":""}
"prediction_col": "prediction", # @param
{"label":"prediction_col","type":"string","required":"false","helpTip":""}
"max_depth": 5, # @param
{"label":"max_depth","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""}
"max_bins": 32, # @param
{"label":"max_bins","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""}
"min_instances_per_node": 1, # @param
{"label":"min_instances_per_node","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""}
"min_info_gain": 0.0, # @param {"label":"min_info_gain","type":"number","required":"false","helpTip":""}
"impurity": "gini" # @param
{"label":"impurity","type":"enum","required":"false","options":"entropy,gini","helpTip":""}
}dt_classify_feature_importance____id___ = MLSDecisionTreeClassifierFeatureImportance(**params)
dt_classify_feature_importance____id___.run()
# @output {"label":"dataframe","name":"dt_classify_feature_importance____id___.get_outputs() ['output_port_1']","type":"DataFrame"}
参数 子参数 参数说明
pipeline_model 参数可选,如果含有该参数,表示根据上游的 pyspark pipeline模型对象来计算特征重要性 decision_tree_r
input_columns_str - 数据集的特征列名组成的格式化字符串,例如:
"column_a"
"column_a,column_b"
label_col - 目标列名
model_input_featu
res_col - 特征向量的列名
prediction_col - 训练模型时,预测结果对应的列名,默认为
"prediction"
max_depth - 树的最大深度,默认为5
max_bins - 分割特征时的最大分箱个数,默认为32 min_instances_per_
node - 决策树分裂时要求每个节点必须包含的实例数
目,默认为1
min_info_gain - 最小信息增益,默认为0.0
样例
inputs = {
"max_depth": 5, # @param
{"label":"max_depth","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""}
"max_bins": 32, # @param
{"label":"max_bins","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""}
"min_instances_per_node": 1, # @param
{"label":"min_instances_per_node","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""}
"min_info_gain": 0.0, # @param {"label":"min_info_gain","type":"number","required":"false","helpTip":""}
"impurity": "variance"
}dt_regression_feature_importance____id___ = MLSDecisionTreeRegressorFeatureImportance(**params)
dt_regression_feature_importance____id___.run()
# @output {"label":"dataframe","name":"dt_regression_feature_importance____id___.get_outputs() ['output_port_1']","type":"DataFrame"}
inputs datafra
me 参数必选,表示输入的数据集;如果没有pipeline_model 和gbt_classify_model参数,表示直接根据数据集训练 gbdt分类模型得到特征重要性
pipeline_
model 参数可选,如果含有该参数,表示根据上游的pyspark pipeline模型对象pipeline_model来计算特征重要性
参数 子参数 参数说明 classifier_lab
el_index_col - 将目标列按照标签编码后的列名,默认为"label_index"
prediction_i
ndex_col - 训练模型时,预测结果对应标签的列名,默认为
"prediction_index"
prediction_c
ol - 训练模型时,预测结果对应的列名,默认为"prediction"
max_depth - 树的最大深度
max_bins - 特征分裂时的最大分箱个数 min_instanc
es_per_node - 树分裂时要求每个节点必须包含的实例数目,默认为1 min_info_ga
in - 最小信息增益
max_iter - 最大迭代次数 step_size - 步长
subsampling
_rate - 训练每棵树时,对训练集的抽样率
样例
inputs = {
"dataframe": None, # @input {"label":"dataframe","type":"DataFrame"}
"pipeline_model": None, # @input {"label":"pipeline_model","type":"PipelineModel"}
"gbt_classify_model": None }params = {
"inputs": inputs,
"input_columns_str": "", # @param {"label": "input_columns_str", "type": "string", "required": "false",
"helpTip": ""}
"label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": ""}
"model_input_features_col": "model_features", # @param {"label": "model_input_features_col", "type":
"string", "required": "false", "helpTip": ""}
"classifier_label_index_col": "label_index", # @param {"label": "classifier_label_index_col", "type":
"string", "required": "false", "helpTip": ""}
"prediction_index_col": "prediction_index", # @param {"label": "prediction_index_col", "type": "string",
"required": "false", "helpTip": ""}
"prediction_col": "prediction", # @param {"label": "prediction_col", "type": "string", "required": "false",
"subsampling_rate": 1.0 # @param {"label": "subsampling_rate", "type": "number", "required": "false",
"helpTip": ""}
}gbt_classifier_feature_importance____id___ = MLSGBTClassifierFeatureImportance(**params)
gbt_classifier_feature_importance____id___.run()
# @output {"label":"dataframe","name":"gbt_classifier_feature_importance____id___.get_outputs() ['output_port_1']","type":"DataFrame"}
inputs datafram
e 参数必选,表示输入的数据集;如果没有
pipeline_model和gbt_regressor_model参数,表示直接 根据数据集训练梯度提升树回归模型得到特征重要性 pipeline_
model 参数可选,如果含有该参数,表示根据上游的pyspark pipeline模型对象pipeline_model来计算特征重要性
label_col - 目标列名 model_input
_features_co - 特征向量的列名
参数 子参数 参数说明 min_instanc
es_per_node - 决策树分裂时要求每个节点必须包含的实例数目,默认 为1
min_info_gai
n - 最小信息增益,默认为0
subsampling
_rate - 训练每棵树时,对训练集的抽样率,默认为1 max_iter - 最大迭代次数,默认为20
step_size - 步长,默认为0.1
样例
inputs = {
"dataframe": None, # @input {"label":"dataframe","type":"DataFrame"}
"pipeline_model": None, # @input {"label":"pipeline_model","type":"PipelineModel"}
"gbt_regressor_model": None }params = {
"inputs": inputs,
"input_columns_str": "", # @param {"label": "input_columns_str", "type": "string", "required": "false",
"helpTip": ""}
"label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": ""}
"model_input_features_col": "model_features", # @param {"label": "model_input_features_col", "type":
"string", "required": "false", "helpTip": ""}
"prediction_col": "prediction", # @param {"label": "prediction_col", "type": "string", "required": "false",
"helpTip": ""}
"max_depth": 5, # @param {"label": "max_depth", "type": "integer", "required":
"false","range":"(0,2147483647]", "helpTip": ""}
"max_bins": 32, # @param {"label": "max_bins", "type": "integer", "required":
"false","range":"(0,2147483647]", "helpTip": ""}
"min_instances_per_node": 1, # @param {"label": "min_instances_per_node", "type": "integer",
"required": "false","range":"(0,2147483647]", "helpTip": ""}
"min_info_gain": 0.0, # @param {"label": "min_info_gain", "type": "number", "required": "false",
"helpTip": ""}
"subsampling_rate": 1.0, # @param {"label": "subsampling_rate", "type": "number", "required": "false",
"helpTip": ""}
"loss_type": "squared", # @param {"label": "loss_type", "type": "enum", "required": "false", "options":
"squared, absolute", "helpTip": ""}
"max_iter": 20, # @param {"label": "max_iter", "type": "integer", "required":
"false","range":"(0,2147483647]", "helpTip": ""}
"step_size": 0.1, # @param {"label": "step_size", "type": "number", "required": "false", "helpTip": ""}
"impurity": "variance"
}gbt_regression_feature_importance____id___ = MLSGBTRegressorFeatureImportance(**params)
gbt_regression_feature_importance____id___.run()
输入
参数 子参数 参数说明
inputs dataframe inputs为字典类型,dataframe为pyspark 中的DataFrame类型对象
输出
数据集
参数说明
参数 子参数 参数说明
select_columns_str - 列名组成的格式化字符串,例如:
"column_a"
"column_a,column_b"
n_estimators - 基学习器的数量,默认为100
max_samples - 从数据集中抽取多少个样本来训练,支持
"auto"、int类型、float类型 contamination -
-max_features - 从数据集中抽取多少数量的特征来训练每 个基训练器
bootstrap - 构建树时,下次是否替换采样,True表示 替换,False表示不替换
样例
inputs = {
"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
}params = { "inputs": inputs,
"select_columns_str": "", # @param
{"label":"select_columns_str","type":"string","required":"false","helpTip":""}
"n_estimators": 100, # @param {"label":"n_estimators","type":"integer","required":"false","helpTip":""}
"max_samples": "auto", # @param {"label":"max_samples","type":"string","required":"false","helpTip":""}
"contamination": "auto", # @param
7.5.1.1.9 百分位统计
概述
对用户选择的数值列进行百分位统计。
输入
参数 子参数 参数说明
inputs dataframe inputs为字典类型,dataframe为pyspark中的 DataFrame类型对象
"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
}params = { "inputs": inputs,
"select_columns_str": "" # @param
{"label":"select_columns_str","type":"string","required":"false","helpTip":""}
}percentile_statistics____id___ = MLSPercentileStatistics(**params) percentile_statistics____id___.run()
# @output {"label":"dataframe","name":"percentile_statistics____id___.get_outputs() ['output_port_1']","type":"DataFrame"}
7.5.1.1.10 直方图
输入
参数 子参数 参数说明
inputs dataframe inputs为字典类型,dataframe为pyspark中 的DataFrame类型对象
输出
无
参数说明
参数 子参数 参数说明
select_column_name - 选择列的列名 string_bucket_show_
numerical_interval - 如果选择列为数值列,该参数表示特征值的 区间长度
样例
inputs = {
"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
}params = { "inputs": inputs,
"select_column_name": "", # @param
{"label":"select_column_name","type":"string","required":"true","helpTip":""}
"string_bucket_show_num": 10, # @param
{"label":"string_bucket_show_num","type":"integer","required":"true","helpTip":""}
"numerical_bucket_show_num": 10, # @param
{"label":"numerical_bucket_show_num","type":"integer","required":"true","helpTip":""}
"numerical_interval": 0.05 # @param
{"label":"numerical_interval","type":"float","required":"true","helpTip":""}
}plot_bar_chart____id___ = MLSPlotBarChart(**params)
plot_bar_chart____id___.run()
7.5.1.1.11 折线图
输入
参数 子参数 参数说明
inputs dataframe inputs为字典类型,dataframe为pyspark中 的DataFrame类型对象。
输出
无
参数说明
参数 子参数 参数说明
select_columns_str - 列名组成的格式化字符串,例如:
"column_a"
"column_a,column_b"
start_index - 画折线图时,数据集转成的数组的起始索引
end_index - 画折线图时,数据集转成的数组的终点索引
figure_length - 图的长度 figure_width - 图的宽度
样例
inputs = {
"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
}params = { "inputs": inputs,
"select_columns_str": "", # @param
{"label":"select_columns_str","type":"string","required":"true","helpTip":""}
"start_index": 0, # @param {"label":"start_index","type":"integer","required":"true","helpTip":""}
"end_index": 0, # @param {"label":"end_index","type":"integer","required":"true","helpTip":""}
"figure_length": 30, # @param {"label":"figure_length","type":"integer","required":"false","helpTip":""}
"figure_width": 10 # @param {"label":"figure_width","type":"integer","required":"false","helpTip":""}
}plot_line____id___ = MLSPlotLine(**params)
输入
参数 子参数 参数说明
inputs dataframe inputs为字典类型,dataframe为pyspark中的 DataFrame类型对象
al_length - 如果numeric_intervals_str没有设置,默认饼形图 的每个区间的长度一样,
numeric_interval_length表示此时的区间长度 show_share_nu
mber - 饼形图的份额数目,默认为5
figure_length - 图的长度 figure_width - 图的宽度
样例
inputs = {
"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
}params = { "inputs": inputs,
"select_column_name": "", # @param
{"label":"select_column_name","type":"string","required":"true","helpTip":""}
"numeric_intervals_str": "", # @param
{"label":"numeric_intervals_str","type":"string","required":"false","helpTip":""}
"numeric_interval_length": "", # @param
{"label":"numeric_interval_length","type":"string","required":"false","helpTip":""}
"show_share_number": 5, # @param
{"label":"show_share_number","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""}
7.5.1.1.13 散点图
概述
对数据集画出对应的散点图。
输入
参数 子参数 参数说明
inputs dataframe inputs为字典类型,dataframe为 pyspark中的DataFrame类型对象
输出
无
参数说明
参数 子参数 参数说明
start_index - 只对数据集转成的数组的某个区间内元
素化散点图,start_index表示开始位置
end_index - 只对数据集转成的数组的某个区间内元
素化散点图,end_index表示结束位置 x_axis_column_name - 散点图x轴的列名
y_axis_columns_str - 散点图y轴的某些列,
y_axis_columns_str表示用列名逗号隔 开的字符串
figure_length - 图的长度
figure_width - 图的宽度
样例
inputs = {
"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
}
{"label":"figure_width","type":"integer","required":"false","range":"[0,2147483647]","helpTip":""}
}plot_scatter____id___ = MLSPlotScatter(**params) plot_scatter____id___.run()
7.5.1.1.14 随机森林分类特征重要性
概述
采用随机森林分类算法计算数据集特征的特征重要性
输入
参数 子参数 参数说明
inputs dataframe 参数必选,表示输入的数据集;如果没有
pipeline_model和random_forest_classify_model 参数,表示直接根据数据集训练随机森林分类模型 得到特征重要性
pipeline_mod
el 参数可选,如果含有该参数,表示根据上游的
pyspark pipeline模型对象pipeline_model来计算特 征重要性
label_col - 目标列名 model_input - 特征向量的列名
参数 子参数 参数说明 prediction_c
ol - 训练模型时,预测结果对应的列名,默认为
"prediction"
max_depth - 树的最大深度,默认为5
max_bins - 特征分裂时的最大分箱个数,默认为32 min_instanc
es_per_node - 树分裂时要求每个节点必须包含的实例数目,默认 为1
min_info_ga
in - 最小信息增益,默认为0.0
impurity - 纯度,支持"gini"和"entropy",默认为"gini"
num_trees - 树的个数,默认为20 feature_subs
et_strategy - 每个树节点分裂时使用的特征个数,默认为"all"
subsampling
_rate - 训练每棵树时,对训练集的抽样率,默认为1.0
seed - 随机数种子,默认为0
样例
inputs = {
"dataframe": None, # @input {"label":"dataframe","type":"DataFrame"}
"pipeline_model": None, # @input {"label":"pipeline_model","type":"PipelineModel"}
"random_forest_classify_model": None }params = {
"inputs": inputs,
"input_columns_str": "", # @param {"label": "input_columns_str", "type": "string", "required": "false",
"helpTip": ""}
"label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": ""}
"model_input_features_col": "model_features", # @param {"label": "model_input_features_col", "type":
"string", "required": "false", "helpTip": ""}
"classifier_label_index_col": "label_index", # @param {"label": "classifier_label_index_col", "type":
"string", "required": "false", "helpTip": ""}
"prediction_index_col": "prediction_index", # @param {"label": "prediction_index_col", "type": "string",
"required": "false", "helpTip": ""}
"prediction_col": "prediction", # @param {"label": "prediction_col", "type": "string", "required": "false",
"helpTip": ""}
"max_depth": 5, # @param {"label": "max_depth", "type": "integer", "required":
"false","range":"(0,2147483647]", "helpTip": ""}
"max_bins": 32, # @param {"label": "max_bins", "type": "integer", "required":