• 沒有找到結果。

7.5 预置算子说明

7.5.3 模型工程

7.5.3.1 分类

7.5.3.3.2 聚类评估

概述

对聚类模型预测的结果数据集进行评估。

输入

参数 子参数 参数说明

inputs datafra

me inputs为字典类型,dataframe为pyspark中的DataFrame类 型对象

_col - 预测结果数据集中,预测列的列名,默认为"prediction"

样例

inputs = {

"predict_dataframe": None # @input {"label":"dataframe","type":"DataFrame"}

}params = { "inputs": inputs,

"features_col": "model_features", # @param {"label": "features_col", "type": "string", "required": "true",

"helpTip": ""}

"prediction_col": "prediction" # @param {"label": "prediction_col", "type": "string", "required": "true",

"helpTip": ""}

}cluster_evaluation____id___ = MLSClusterEvaluation(**params) cluster_evaluation____id___.run()

# @output {"label":"dataframe","name":"cluster_evaluation____id___.get_outputs() ['output_port_1']","type":"DataFrame"}

输入

参数 子参数 参数说明

inputs dataframe inputs为字典类型,dataframe为pyspark中的 DataFrame类型对象

pipeline_mod

el inputs为字典类型,pipeline_model为spark pipeline类 型的模型对象

"dataframe": None, # @input {"label":"dataframe","type":"DataFrame"}

"pipeline_model": None # @input {"label":"pipeline_model","type":"PipelineModel"}

}params = { "inputs": inputs

}model_predict____id___ = MLSModelPredict(**params)

model_predict____id___.run()

# @output {"label":"dataframe","name":"model_predict____id___.get_outputs() ['output_port_1']","type":"DataFrame"}

参数说明

参数 子参数 参数说明

label_col - 目标列 prediction_inde

x_col - 输入预测数据集的预测label 标签列的列名 label_index_col - 输入预测数据集的真实label标签列的列名

样例

inputs = {

"predict_dataframe": None # @input {"label":"dataframe","type":"DataFrame"}

}params = { "inputs": inputs,

"label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": ""}

"prediction_index_col": "prediction_index", # @param {"label": "prediction_index_col", "type": "string",

"required": "true", "helpTip": ""}

"label_index_col": "label_index" # @param {"label": "label_index_col", "type": "string", "required": "true",

"helpTip": ""}

}multi_class_evaluation____id___ = MLSMultiClassEvaluation(**params)

multi_class_evaluation____id___.run()

# @output {"label":"dataframe","name":"multi_class_evaluation____id___.get_outputs() ['output_port_1']","type":"DataFrame"}

inputs datafra

me inputs为字典类型,dataframe为pyspark中的DataFrame 类型对象

输出

参数 子参数 参数说明 prediction_c

ol - 预测结果数据集的预测列的列名

样例

inputs = {

"predict_dataframe": None # @input {"label":"dataframe","type":"DataFrame"}

}params = { "inputs": inputs,

"label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": ""}

"prediction_col": "prediction" # @param {"label": "prediction_col", "type": "string", "required": "true",

"helpTip": ""}

}regression_evaluation____id___ = MLSRegressionEvaluation(**params)

regression_evaluation____id___.run()

# @output {"label":"dataframe","name":"regression_evaluation____id___.get_outputs() ['output_port_1']","type":"DataFrame"}

inputs dataframe inputs为字典类型,dataframe为pyspark中 的DataFrame类型对象

输出

参数 子参数 参数说明

rating_col - 评分所在的列名

recommend_nums - 推荐物品的个数,默认为10 prediction_col - 预测列列名,默认为"prediction"

cold_start_strategy - 冷启动策略,默认为"nan"

alpha - 矩阵分解的正则化系数,默认为1.0

implicit_prefs - 是否使用隐含偏好,默认为Flase

max_iter - 最大迭代次数,默认为50

non_negative - 是否使用非负限制,默认为False

rank - 因子分解的秩,默认为10

reg_param - 正则化系数,默认为0.0

样例

inputs = {

"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}

}params = { "inputs": inputs,

"user_col": "", # @param {"label":"user_col","type":"string","required":"true","helpTip":""}

"item_col": "", # @param {"label":"item_col","type":"string","required":"true","helpTip":""}

"rating_col": "", # @param {"label":"rating_col","type":"string","required":"true","helpTip":""}

"recommend_nums": 10, # @param

{"label":"recommend_nums","type":"integer","required":"false","range":"(0,2147483647)","helpTip":""}

"prediction_col": "prediction", # @param

{"label":"prediction_col","type":"string","required":"false","helpTip":""}

"cold_start_strategy": "nan", # @param

{"label":"cold_start_strategy","type":"string","required":"false","helpTip":""}

"alpha": 1, # @param

{"label":"alpha","type":"number","required":"false","range":"(none,none)","helpTip":""}

"implicit_prefs": False, # @param {"label":"implicit_prefs","type":"boolean","required":"false","helpTip":""}

"max_iter": 10, # @param

{"label":"max_iter","type":"integer","required":"false","range":"(0,2147483647)","helpTip":""}

"non_negative": False, # @param

{"label":"non_negative","type":"boolean","required":"false","helpTip":""}

"rank": 10, # @param

{"label":"rank","type":"integer","required":"false","range":"(0,2147483647)","helpTip":""}

"reg_param": 0.1 # @param

{"label":"reg_param","type":"number","required":"false","range":"(none,none)","helpTip":""}

}als____id___ = MLSALS(**params) als____id___.run()

7.5.3.5.1 决策树回归

概述

“决策树回归”节点用于产生回归模型。

决策树算法是递归地构建决策树的过程,用平方误差最小准则,进行特征选择,生成 二叉树。平方误差计算公式如下:

其中 是样本类标的均值,yi 是样本的标签,N 是样本数量。

输入

参数 子参数 参数说明

inputs dataframe inputs为字典类型,dataframe为pyspark中 的DataFrame类型对象

输出

spark pipeline类型的模型

参数说明

参数 子参数 参数说明

b_use_default_enco

der - 是否使用默认编码,默认为True

input_features_str - 输入的列名以逗号分隔组成的字符串,例 如:

"column_a"

"column_a,column_b"

参数 子参数 参数说明

min_info_gain - 最小信息增益,默认为0.0

样例

inputs = {

"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}

}params = { "inputs": inputs, "b_output_action": True,

"b_use_default_encoder": True, # @param {"label": "b_use_default_encoder", "type": "boolean",

"required": "true", "helpTip": ""}

"input_features_str": "", # @param {"label": "input_features_str", "type": "string", "required": "false",

"helpTip": ""}

"outer_pipeline_stages": None,

"label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": ""}

"regressor_feature_vector_col": "model_features", # @param {"label": "regressor_feature_vector_col",

"type": "string", "required": "true", "helpTip": ""}

"max_depth": 5, # @param {"label": "max_depth", "type": "integer", "required": "true", "range":

"(0,2147483647]", "helpTip": ""}

"max_bins": 32, # @param {"label": "max_bins", "type": "integer", "required": "true", "range":

"(0,2147483647]", "helpTip": ""}

"min_instances_per_node": 1, # @param {"label": "min_instances_per_node", "type": "integer",

"required": "true", "range": "(0,2147483647]", "helpTip": ""}

"min_info_gain": 0.0, # @param {"label": "min_info_gain", "type": "number", "required": "true", "range":

"[0.0,none)", "helpTip": ""}

"impurity": "variance"

}dt_regressor____id___ = MLSDecisionTreeRegression(**params)

dt_regressor____id___.run()

# @output {"label":"pipeline_model","name":"dt_regressor____id___.get_outputs() ['output_port_1']","type":"PipelineModel"}

输入

参数 子参数 参数说明

inputs dataframe inputs为字典类型,dataframe为pyspark中 的DataFrame类型对象

输出

spark pipeline类型的模型

参数说明

参数 子参数 参数说明

b_use_default_enco

der - 是否使用默认编码,默认为True

input_features_str - 输入的列名以逗号分隔组成的字符串,例 如:

"column_a"

"column_a,column_b"

label_col - 目标列

regressor_feature_v

ector_col - 算子输入的特征向量列的列名,默认为

"model_features"

max_depth - 树的最大深度,默认为5

max_bins - 最大分箱数,默认为32

min_instances_per_

node - 节点分割时,要求子节点必须包含的最少实

例数,默认为1

min_info_gain - 节点是否分割要求的最小信息增益,默认为 0.0

subsampling_rate - 学习每棵决策树用到的训练集的抽样比例,

默认为1.0

loss_type - 损失函数类型,支持squared、absolute,默 认为"squared"

max_iter - 最大迭代次数,默认为20

"inputs": inputs, "b_output_action": True,

"b_use_default_encoder": True, # @param {"label": "b_use_default_encoder", "type": "boolean",

"required": "true", "helpTip": ""}

"input_features_str": "", # @param {"label": "input_features_str", "type": "string", "required": "false",

"helpTip": ""}

"outer_pipeline_stages": None,

"label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": "target label column"}

"regressor_feature_vector_col": "model_features", # @param {"label": "regressor_feature_vector_col",

"type": "string", "required": "true", "helpTip": ""}

"max_depth": 5, # @param {"label": "max_depth", "type": "integer", "required": "true", "range":

"(0,2147483647]", "helpTip": ""}

"max_bins": 32, # @param {"label": "max_bins", "type": "integer", "required": "true", "range":

"(0,2147483647]", "helpTip": ""}

"min_instances_per_node": 1, # @param {"label": "min_instances_per_node", "type": "integer",

"required": "true", "range": "(0,2147483647]", "helpTip": ""}

"min_info_gain": 0.0, # @param {"label": "min_info_gain", "type": "number", "required": "true", "range":

"[0.0,none)", "helpTip": ""}

"subsampling_rate": 1.0, # @param {"label": "subsampling_rate", "type": "number", "required": "true",

"range": "(0.0,1.0]", "helpTip": ""}

"loss_type": "squared", # @param {"label": "loss_type", "type": "enum", "required": "true", "options":

"squared,absolute", "helpTip": ""}

"max_iter": 20, # @param {"label": "max_iter", "type": "integer", "required": "true", "range":

"(0,2147483647]", "helpTip": ""}

"step_size": 0.1, # @param {"label": "step_size", "type": "number", "required": "true", "range":

"(0.0,none)", "helpTip": ""}

"impurity": "variance"

}gbt_regressor____id___ = MLSGBTRegression(**params)

gbt_regressor____id___.run()

# @output {"label":"pipeline_model","name":"gbt_regressor____id___.get_outputs() ['output_port_1']","type":"PipelineModel"}

7.5.3.5.3 LightGBM 回归

概述

对mmlspark python包中LightGBM回归的封装

输入

参数 子参数 参数说明

inputs dataframe inputs为字典类型,dataframe为pyspark中的 DataFrame类型对象

输出

spark pipeline类型的模型

参数 子参数 参数说明

label_col - 目标列

regressor_feature_

vector_col - 算子输入的特征向量列的列名,默认为

"model_features"

prediction_col - 算子输出的预测label的列名,默认为

"prediction"

objective - 目标函数,默认为"regression"

max_depth - 树的最大深度,默认为-1

num_iteration - 迭代次数,默认为100 learning_rate - 学习率,默认为0.1 num_leaves - 叶子数目,默认为31

max_bin - 最大分箱数,默认为255

bagging_fraction - bagging的比例,默认为1 bagging_freq - bagging的频率,默认为0

bagging_seed - bagging时的随机数种子,默认为3 early_stopping_rou

nd - 提前结束迭代的轮数,默认为0

feature_fraction - 特征的比例,默认为1.0 min_sum_hessian_i

n_leaf - 一个叶子上最小hessian和。取值区间为[0, 1],默认为1e-3

boost_from_averag

e - 是否将初始分数调整为标签的平均值,以加快

收敛速度,,默认为True boosting_type - 提升方法的提升类型。

可选值有:gbdt、gbrt、rf、dart、goss,默 认为gbdt

lambda_l1 - L1正则化系数,默认为0.0 lambda_l2 - L2正则化系数,,默认为0.0

num_batches - 如果大于0,在训练中将数据集分割成不同的 批次,默认为0

params = { "inputs": inputs, "b_output_action": True, "outer_pipeline_stages": None, "input_features_str": "", # @param

{"label":"input_features_str","type":"string","required":"false","helpTip":""}

"label_col": "", # @param {"label":"label_col","type":"string","required":"true","helpTip":""}

"regressor_feature_vector_col": "model_features", # @param

{"label":"regressor_feature_vector_col","type":"string","required":"false","helpTip":""}

"prediction_col": "prediction", # @param

{"label":"prediction_col","type":"string","required":"false","helpTip":""}

"objective": "regression", # @param {"label":"objective","type":"string","required":"false","helpTip":""}

"max_depth": -1, # @param

{"label":"max_depth","type":"integer","required":"false","range":"[-1,2147483647]","helpTip":""}

"num_iteration": 100, # @param

{"label":"num_iteration","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""}

"learning_rate": 0.1, # @param {"label":"learning_rate","type":"number","required":"false","helpTip":""}

"num_leaves": 31, # @param

"early_stopping_round": 0, # @param

{"label":"early_stopping_round","type":"integer","required":"false","range":"[0,2147483647]","helpTip":""}

"feature_fraction": 1.0, # @param

{"label":"feature_fraction","type":"number","required":"false","helpTip":""}

"min_sum_hessian_in_leaf": 1e-3, # @param

{"label":"min_sum_hessian_in_leaf","type":"number","required":"false","helpTip":""}

"boost_from_average": True, # @param

{"label":"boost_from_average","type":"boolean","required":"false","helpTip":""}

"boosting_type": "gbdt", # @param

{"label":"boosting_type","type":"string","required":"false","helpTip":""}

"lambda_l1": 0.0, # @param {"label":"lambda_l1","type":"number","required":"false","helpTip":""}

"lambda_l2": 0.0, # @param {"label":"lambda_l2","type":"number","required":"false","helpTip":""}

"num_batches": 0, # @param

{"label":"num_batches","type":"integer","required":"false","range":"[0,2147483647]","helpTip":""}

"parallelism": "data_parallel" # @param

{"label":"parallelism","type":"string","required":"false","helpTip":""}

}lightgbm_regressor____id___ = MLSLightGbmRegression(**params)

lightgbm_regressor____id___.run()

# @output {"label":"pipeline_model","name":"lightgbm_regressor____id___.get_outputs() ['output_port_1']","type":"PipelineModel"}

7.5.3.5.4 线性回归

概述

“线性回归”节点用于产生线性回归模型。它是利用数理统计中的回归分析,来确定 两种或两种以上变数间相互依赖的定量关系的统计分析方法。